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Introduction 


‘Many years ago, I was working for an organization 
transitioning from being a small start-up with anew 
product to having been acquired by a large multinational 
corporation that wanted to push the product everywhere. 
‘As we started to expang, I found myself interacting with 
‘many newly graduated new hires. Minds with passion, 
intellect and energy, but sometimes struggled with what 
they perceived as resistance to change or pressures they 
hhad not been exposed to as students. 


| found myself compelled to share the wisdom Ihad gained 
from my experiences with these young professionals. 1 
believed that my insights could help them navigate the 
challenges they were facing and enhance their 
understanding of the field. 


‘These individuals had many great ideas, but they often 
lacked the practical perspective of inter office politics, 
customer expectations, and value-driven deliverables. 
‘They had studied fascinating concepts in school and were 
‘eager to implement them. This inspired me to find ways to 
share a more practical perspective, one that could help 
them turn their ideas into tangible solutions. 


‘This book is a compilation of my observations and 
experiences in Information Systems over the years. It 
includes examples of non-optimal solutions that have 
propelled us forward in our quest for better solutions and 
insights into the intricate systems that shape our work 
environment and influence our goal of creating efficient 
systems, 


‘Sometimes, they are just fun, 


I wrote ll ofthis hoping that people are just entering the 
industry because they are recent grads or managers trying 
to better understand the people around them, 


It is probably worth noting that this is written by me, not a 
computer. In the world of LLM and generated content, 1 
tried to keep my best human storytelling voice. I did reach 
‘out for help to AI, but I found it bland and sterile, so keep 
in mind: chapter title pages are Al; stories are all me. 


At the end of the day, remember that computers were built 
to serve us, not the other way around. 


Shit Disturbers 
Reinvent the Wheel 


Redesigning solutions to fit problems is necessary 


Identifying the existence of problems is not a 
failure 


Bringing solutions to problems is the point of The 
Art 


Automated testing of systems is my pet peeve: I think. 
every computer system should havea series of tests that 
get run by another computer that tests every problem ever 
thought of. My current customer has asked me to start 
developing a system just like this for their record-keeping 
and delivery system, 


Currently, the customer has purchased a third-party tool 
for automating control of the software, unfortunately, the 
tool sucks. Itis very difficult for non-programmers to 
understand its roundabout logic (ituses 
sereen-scraper-triggered events), and has no mechanism 
for managing large numbers of scripts (each one is 
‘managed in and of itself), When building testing systems, 

the tests themselves 
tend tobe easy to 
create; managing and 
tracking large numbers 
of tests becomes the 
problem. 


Being adiligent 
consultant (alright, a 
diligent problem 
solver), I suggested it 
was possible to build a 
‘custom tool that 
‘wrapped the objects and was better able to be understood 
by non-programmers, allowing programmers to more 


easily manage a large number of tests 
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Reinventing the Wheel 


‘That's when Iheard it: 


‘There's no point reinventing the wheel 


take exception to this, 1am encouraged to solve problems 
in the office and invention is the key to doing this. 1 
recognize that all problems have been solved; we already 
have wheels. The only problem that ever exists is the need 
to refine the general solution to the particular instance of 
the problem; weneed wheels suited to the current task. 


If we had never reinvented the wheel we would still be 
driving around on Wagon Wheels. And 


myself like having soft rubber tires on my car: 


Inthe end, we reinvent the wheel regularly, not every 
wheel is perfect for every 

vehicle. Similarly, when 
solving problems at the 
workplace designing 
systems, itis sometimes 
necessary to build a custom. 
‘component that suits the 
needs of the problem. 


Whilenot a total reinvention, 
they area design better 
suited to the problem at 
hand. To work around the 
foibles of the existing 
technology, just because the technology already exists, is 


un 


ag {he kind of short-sightedness that leads to planes falling 
© outof the sky. 


ie Shit Disturber 


‘Naturally, the moment I suggest all of this, Iam accused of 
being a Shit Disturber. But 


When someone accuses me of being a 
Shit Disturber, | know 'm on the right 
track. 


Shit Disturber. Let's break that term down; “shit” and 
“disturber”; or a disturber of shit For this to be true, there 
‘must be shit to be disturbed. 


‘That I am being accused of being a Shit Disturber forces my 
audience to acknowledge that there is (infact) shie present. 


If there is shit present that has been ignored and avoided; 
itmay be more important to ask questions like, When does 
somebody intend to do something about this shit? (This is 
usually the hardest part of convincing people to change: 
getting them to acknowledge that there isa problem which 
requires fixing.) 


Being a Disturber of Shit is not abad thing. 


Just because you are disturbing the shit, does not mean you 
putit there. 


Ifthe shits in the middle of the road, we can either ignore 
the shit or do something about it 
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"Naturally, this causes some discomfort: people have got 
used to their path around the shit; while tis being moved, 
the shit tends to stink; people have a hard enough time 
cleaning their own shit (let alone someone else's); and the 
person that put the shit there probably feels lke shit for 
not cleaning it up in the first place. 


‘The Disturber is just the person willing to do something 
about the problem. The fact of the matter is, that we can 
ignore problems for a long time, or put up with the 
temporary discomfort of fixing them, 


Conclusion 


Shit stinks and The Wheel turns; these are two truths of 
the world. Ignoring them does not make them go away. 


Inlife, we need to identify problems (shit), find solutions 
(reinvent the wheel), and make the changes to enact those 
solutions (disturb the shit) 


Inthe past, Ihave been both punished and praised for 
taking drastic action to solve drastic problems (often 
regarding the same problem and by the same person), 
While we may find change uncomfortable, we should never 
‘tum away from thesesolutions. 


Soa tip ofthe hat to all those Shit Disturbers out there; 
‘may you always keep finding ways to reinvent the wheel. 


B 
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Fun with Markov 
Network Brains 


‘An introduction to evolutionary machine 
learning 


Evolutionary machine learning algorithms arean 
expression of Darwinian Evolution 


‘+ Asimple in-browser demonstration of an 
algorithm (Markov-Network Brains) is 
demonstrated 


‘+ Creating for creation's sake: creating solutions and 
simulations is an act of beauty 


‘ve te course of about 1000 generate Markov Network Bra eaves 
un bugs ft) aometing capable of dig food medal (a). Te 
‘ge achive he wih no awareness of ther eniteent heond the yal 
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was at my daughter's wedding, and (naturally) the 
conversation turned toward the capabilities and 
limitations of Artificial Intelligence. 


Lam not an Al expert, and my exposure to it has been 
sketchy at best. Ihave implemented algorithms but have 
never really donea deep dive into any of them to 
understand the mechanics. So, when the conversation 
brought up the concept of genetic or evolutionary 
algorithms, we both had to confess we didn't know that 
‘much about them, 


What ttle we did discuss got me curious, andon the plane 
19 ride home, ame across an article on Markov Network 
i brains (MNB). When first read about MNBs, something 
SETLapE very dep resonated with past agriculture and medical 
txperience, and I wanted to go deper into the mechanics 
behind them, 


‘gut feeling of famlarty was only enough to whet my 
ppette; what I needed was simple implementation that 
Teould step through to observe the changes as they 
happened. ama fan of Browser-side JavaScript for 
solving problems (there is always a compiler and debugger 
(@ handy), so Twas fortunate to stumble across an 
BEEPS inornctcimasegeinplonenttonen tab 


“The Adami Lab: Markov Network Brains 

http: /adamilab.msu.edu/markov-networkbrains/ 

2 Philip Neal: MNB JS. As of publication, Neal has 
{improved on his work and his take on the problem results 
ina different implemention style. Definitely worth 
comparing. 

https: //github.com/pnealgit/mnb_js 
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‘My initial intention was simply to step through the code to 
understand the process; however, after addressing a few 
‘minor visual bugs, I found I had over-tinkered, leaving me 
with some heavy lifting to get it to work. A deep tear-down § 
‘was necessary and resulted in afun little simulation. 


What better way to really understand what was going on? 
ALayman's Description cS 


Markov Network Brains are evolutionary algorithms based ©4224 
oon the modern models of genetics and evolution. The same gy. 
natural processes that allow bacteria to become ee 
drug-resistant can be used to breed animals for a specific ‘ 
purpose orto breed a computer program well suited to 
solving a specific problem. 


‘The point of any software algorithm is for the machine to 
learnto solve a problem. intraditional programming, we § 
do this by having ery clever humans write acomputer 
program that inspects some set of input values and creates 
anew set of output values (technically, a function‘), 


-MNBs (really Machine Learning algorithms in general) are 
no different: we have a problem that needs solving, a 
process for solving it, and we base it on some inputs. What 
differentiates an MNB is that we do not directly create the 
program; we allow it to be randomly generated, and slowly 
bringit closer to solving the problem by automatically 


“Intps://jefferey-cavegitab io/mn-js-demoy 

‘Khan Academy: What isa function? 

hntps://wuw khanacademy.org/math algebra/x2f8bb1159 
Sbé1e86 functions/xafebbui595b6xc86:evaluating-functi 
ons/v/what-is-a-function| 
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testing small random changes. (Actually, when phrased 
that way, it doesn't sound different atall) 


mene ‘Three inetd 
Eo bat indepeet 
as componente 
Senta 
‘ bd understanding 


‘MNBs: Genome, 
Brain, and Breeder, 
‘The ability of the 


_ 
e 
4 “= (become better able 


algorithm to learn 


tosolve the 
problem) is tied to the way three components work 
together: 


1. Genome: this is like the programmer's un-compiled 
code. 


2. Brain: one could think of this as an executable, 
‘compiled software. Italso has memory allocated 
for storing information. 


4. Breeder: the developer, judging whether the code 
is successful or not 


Like any software, these three parts are distinct but 
strongly interrelated, Also, like any software, the exciting 
parts happen at the transitions. 


® Extreme Programming: Iterations (1999). Defines an 
iterative feedback approach to software project 
managment, 

tp: ane extremeprogramming org/map/iteration htm 
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Genome Creation 


‘The Genome is initially created as arandom array 
(genomejs:23). Like a DNA genome, this lst wil be used 
tobuild and recreate actors. n our case, it will be compiled 
{nto'The Brain. 


Genome > Brain 


‘A Genome is compiled into a Brain by reading the Genome 
and using the data as the basis forallocating a quantity of wErsesy 
‘memory, initializing the memory values, and allocating 


apes fig Ce ery vd eos 
, ae 
‘anata tn’ an get to en” pat 
"or", and, oF'xor” transforms other transformsare Ss 
possible,* use your imagination) (gates,js:5]. Lastly, the cla0e! 
Genome creates transforms that map the memory oF: 


elements as inputs and outputs (brain.js:104). 


‘Remember that these values were randomly generated, so 
{at least on the first pass) these transforms, and the 
quantity of memory are selected randomly. Basically, you 
hhave generated a completely random program acting on 
random memory elements. 


{An infinite number of monkeys typing on an infinite 
‘number of typewriters. 


Brain > Breeding 


© While simple logic gates have been implemented, 
Hintzelab describes several different types of transforms 
that can be used and are useful 

https: //github.com/Hintzelab/MABE,wiki/Brain-Markov 
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Once The Brain has executed, it will have generated some 
‘outputs. It is up to The Breeder to judge whether the 
‘outputs were of any value or not, or more importantly, 
hich of these executions were most valuable. In a more 
complex environment, itis reasonable for The Breeder to 
observe The Brain in action. 


‘he score stributin a he bugs stats at 0.02002 and improves to 
(025001 ater 28 generations. A plateau was reached at about generation 17 


‘Toachieve this, we need to run several different Brains 
‘many times. Most of these runs wil be useless, but some 
gg Wille useful Like an animal breeder, we can select the 
Bf mast valable genomes and use them asthe bass for 
better genomes, discarding the rest [evolve.js:55] 


Breeding > Genome 


Once The Breeder has selected the most successful 
programs, these programs can be used as the basis for 
‘trying new variations. 


‘This is done through a reproduction process where 
‘genomes are randomly intermixed with one another 


(Gexual reproduction) to produce anew algorithm that has Sips) 
anew mix of decision-making processes | evalvejs:03). cee 
seh 


‘The key is that each of these newly created genomes is a 
little less random than its predecessors; suitable 
decision-making structures are kept, and bad ones are 
culled. Genomes identified as good are recombined with 
‘one another to see if they result in something even better. 
Over time, this process will result in progressively 
improving programs that get closer to solving the 
problem, 


nn = : 


‘Asecond plateau was reached around generation 120, before the 

last improvement at around generation 230 (lef) where it reached 3 

‘score of 085002_Due to hick, a different run (right) took neary 10 
times longer to discover the second plateau 
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Suggestions 


‘As with any system that has randomness involved, 
debugging can be painful. Whether an observed behaviour 
results from luck (good or bad) or faulty programming, 


‘The problem is most observable with bugs being randomly 
placed on top of food and then doing nothing, These bugs 
receive high rewards purely based on luck. Bugs actively 
‘moving in search of food end up being culled for not being 
as successful, This element of luck is undesirable. 


‘lucky bug: the blue bug has ne neural activity but was randomly 
populated right on top ofa food source 


‘As a result of pondering this problematic element of luck, 
‘two unique additions were added that are worth noting: 
the culling of useless brains and the reuse of successful 
genetics. Both of these are based on animal husbandry 
practices, 


Early Culling 


Ifthe randomly generated program does not result in any 
output, itis useless tous. 


‘The brains created are constrained to an array of 
approximately 50 elements of memory. Given 3 outputs, 
there is only a 6% chance that randomly generated 
‘programs will result in meaningful output. Larger memory 
will make these odds even worse. 


‘To overcome this, created a checkin the generation 
routine. Immediately after creating a new brain, its 
transforms are scanned to determine ifit will take any 
action (evolve.js:161). If none of the transforms in the 
brain ever write to the output segment, the gis 
immediately discarded, and anew bug is generated in its 
place [evolvejs:87) 


Karma 


While investigating the element of luck, it occurred tome 
that animal breeders keep track oftheir most successful 
breeding stock: animals with good parentage are likely to 
produce better offspring than animals with poor 
parentage. To simulate this, introduced the concept of 
karma. Karma isa score attached to the genome rather 
than the bugiitsef. Iris calculated at the end of acycle by 
taking the bug's score and averaging it with its genome's 
score (evolves:32). Newly created genomes inherit their 
predecessor's karmic score by averaging the score of 
parent genomes {evolvejs:101} 


ae 


oH@ @ 
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Each brains monitored for activity to help distinguish activity and 
decision-making fram luck. Fom lef to right the karma and bug 
score, sensors, memory state, and outputs (speed change, lean eft, 
lean right) 


When it comes time to compare the genomes for 
effectiveness, karma is used. Evaluating the overall 
‘genome rather than the bug itself allows unlucky genomes 
to get another chance to prove themselves, Continued lack 
‘of success will result in karma slowly declining (eventually 
resulting in a cull), while a single lousy generation will nat 
cause an otherwise successful genome to be lost. 


Unfortunately, Ihave no clear evidence that either of these 
‘were useful or effective, as these were introduced to 
‘compensate for what turned out to be a defect in the brain 
processing itself. While logically sound, it was developed 
due to a long delay in the first generation creation, 
believed to be caused by thousands of bugs being rejected 
Inhindsight, it was determined that this was not the 
actual cause, so there is no way to know ifit made any 
difference. 


Conclusion 


Atsome point in my reading, I came across a statement 
that evolutionary programming has low value because 
similar results can be achieved faster using other 
techniques (there is a counter-argument that they can 
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discore solitons humans cannot consider). This maybe 
true, but I hada lot of fun building this simulation andam BS 44/0 
Sitalpietnene e 


Wilt strani enjoy ic ibid a aes 
Sgaicsts ta yrct tvaved that cetanes ost 
foes md een expeiaton 

coal sco wong ears tion 

cee ny and ees Pay thi 
rahe eee jut ay stot ockoret ee 


From philosophical standpoint, genuinely understanding 
the evolutionary processes involved in this algorithm has 
given me anew perspective on complex, self-forming 
systems. From interpersonal relationships in the office to 
black-market economics, to the way students learn, to 
political alliances, I now see it slightly differently than I 
did. 


It was interesting to watch my own biases in 
decision-making, My original hypothesis was that the 
‘bugs would evolve a spiral search pattern, and what 1 
perceived as defects in their behaviour led me to increase 
punishment for touching the boundary to force them to 
conform. In the end, Iwas surprised to wake up one 
‘morning and discover the bugs had evolved a pattern of, 
ignoring the pain and using the wall to orient themselves 
in their environment. My own little Stanford Experiment 


Lastly, this project has been a reminder of how useful itis 
to build throw-away programs. When learning something 
new, creating a small piece of code is better than tackling a 
giant problem. A small, simplified model can be held in 
your head while you learn. When your organization 
requires a reliable solution, reach for a battle-tested 
library; build from scratch, when you want to understand. 


hattps://en wikipedia org/wiki/Evolved_antenna 
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‘To quote Feynman: 


What | cannot create, | do not understand 


Sen9 
Pose ~ Feynmant 
tos 


Deep understanding is a funny thing; sometimes, you 
‘come to learn that the battle-tested library isn’t as helpful 
as you thought. 


Your Next Steps 


Ifyou are interested in Markov Network Brains, you 
should 


4. Open the simulation” 


2. PressF12 


3. Put abreak-point somewhere in the code 
‘4. Start stepping through it 


‘Stepping through running code and observing the changes 
isthe best way to learn about a program's behaviour. 


"ironically? Poetically? {found that quote via Chris 
‘Adami’ blog while searching for the references for this 
article. Adami isthe guy who started this whole mess for 
https://adamilab blogspot.com/2013/02/your-conscious- 
youhtm! 

"hitps:|/jefferey-cave gitlab io/mnb-js-demo/ 
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Fork and Fix 


‘Aco-worker was asking me about this program and 
started suggesting all kinds of great ideas I could 
implement to make it more interesting.I suddenly realized 
had taken everything I wanted from this litle toy. I'm 
going to move on to other projects. Instead, I suggested he 
should make the changes! 


Fork the project from the same point I did", and build it 
yourself, Or Fork my version" and make a cool, 
‘modification ...either way, I would love to see what you 


‘come up with. rae 
+ Change the physics (collisions, spherical world.) 55 


The exact point that was forked from has a more direct, 
approach 
hntps://gittab.comyjefferey-cave/mnb-js-demoytree/newe 
¥_thinkwww/js 

¥ Modifications I made includea more object oriented 
aproach 
hntps://gitab.com/jefferey-cave/mnb-js-demo/tree/mast 
expos 
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Syne 


‘+ Make the bugs aware of one another (watch 
‘competitive behaviour evolve? cooperative 
behaviour?) 


‘© Finda more challenging problem for them to solve 
(randomize the food, introduce fight and flight to 
make them prey, ..) 


‘© Put the brain ina separate thread (someone 
please do this) or on the GPU”. 


‘+ Probabilistic Logic vs Binary 


‘¢ trust me: the list could go on forever 


Further Reading 


You could also read more about Markov Network Brains 
from people who actually know what they are talking 
about. Ihave taken some liberties with the metaphors 1 
hhave used, and learning the shared metaphors and 
terminology would also be helpful. 


‘© Adami Labs": The original article that I found. Also 
hasa battle-tested C++ library available 


= Mozilla Developer Documenation, WebWorkers 
Inttps:/developer.mozilla.org/en-US/docs/Web/API/Web__ 
Workers_API/Using_web_workers 

® WebGL Fundamentals, offers.a series of tutorials on how 
to use GPUs from the browser 

‘https: //webglfundamentals orgy 

“The Adami Lab, Markov Network Brains 

https: /adamilab.msu.edu/markov-network-brains/ 


30 


‘+ Brain js”: A battle-tested Js library that 
implements MNBs as one of its models 


Act 
@ 


‘+ Wikipedia: Markov Logic Networi:* 


‘+ Modular Agent Based Evolution Framework": 


a § 
io} 


[MABE (python) offers an interesting framework for ne} 
defining all the pats (battle-tested) Pocka 
+ Reddit": ADetailed Criticsmof Evolutionary = EO 
‘Algorithms/Computation by Programmers? Bee 
oe 

a 


a 
Eso 


® Brain js: GPU accelerated Neural networks in JavaScript, 
for Browsers and Node js, 

https: /fbrain js.org/ 

> hitps://en.wikipedia.org/wiki/Markov_logic_network 


‘https: //github.com/Hintzelab/MABE/wiki/MABE-framew 
ork 


Inttps://www-reddit.com/t/compsci/comments/4hwui7/de 
leted_by_user/ 
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Build a Chromolabe 


Understanding algorithms and automated 
decision making, using some coloured 
pencils and paper 


‘+ Computational algorithms represent automated 
decision-making tools 


‘+ Distinct colour selection is a common problem in 
data visualization 


‘+ Amechanical device is demonstrated to show how 
amachine can make decisions 


‘© Suitable for ages 12 t0 120 
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Classes from my personal calendar, 03 ma 
directed graph gant style char, ofan oper 
papular video game. 


choropleth force 
arcs knockoff fa 
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‘Most people rely on their natural human intuition for 
decision-making, and human intuition is based on 
interacting with the physical world around us. Computers, 
con the other hand, dealin abstract ideas, something 
‘humans just aren't well equipped to deal with. 


What goes on inside a computer's mind is a mystery to 
most people. It is scary: you can't see, touch, or taste it. 
‘This makes it hard to understand what programmers are 
doing and, therefore, can be intimidating for people. 


ac st crs ical i ii 
of people witha tishinackformanpuatngabrart GR 
Sabet ati vie aceite to hele weg a oo] 
Seieec pocwnalds pcedongae 


Computer programmers are often faced with the daunting 
task of taking those abstract concepts and making them 
concrete for their audience, a tricky and detailed task. 


One of the tools used to do this is colour. 


Colour is often used to visualise the categorisation of 
ideas: time blocks ina calendar, political parties, 
characters in a game, we use colour everywhere to group 
related things together. 


{As programmers, we have just encountered our first 
problem: how do we choose the colour palette we will use 
in our program? 


© Kill Math, Bret Victor, 2011-04-11 
https: /worrydream.com/KillMath 
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Good Palettes 


‘There are three components to selecting a good colour 
palette for use in visualizations: 


1. High Differentiation 
2. Sufficient Colours 
3. Aesthetically Pleasing 


Every colour palette used in a visualization should 
‘maintain these three essential elements. 


Sufficient Colours 


‘There should be enough colours to meet the needs of the 
visualisation. 


Inthe early days of video games, it was often enough to 
have different colours: ne foreach player. In 
cartography, the numberof colours needed is dictated by 
the number of shared borders; the same colour should 
never touch 


Sufficient is different depending on what you are trying to 
represent. 


High Differentiation 


‘The colours used should be sufficiently different so people 
can tell them apart. Ifadriver tries to tell the difference 
between "stop" and "go" at a traffic intersection, there 
should be no ambiguity in which colour is being seen by 
the audience. 
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Finding food inthe wilderness is a matter of survival for 
hunter/gatherer humans. The differentiation of colour is buitn in 
hhumans.{image: Wikimedia] 


‘This problem is complicated by internationalization 
(colour means different things indifferent cultures), 
biological issues (colour blindness), and many other issues 
(this isn’ta simple process), 


Aesthetically Pleasing 
Atthe end ofthe day, humans are the ones who willbe 
Tooking a this visualization; they should fin it pleasing to 
lookat 

CHoasing complementary colours can often have bizare 


visual consequences. We need to work with the way 
Jhumans are built. There is also a fashionable element to 
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this, colours that are popular today may not be popular 
tomorrow. 


Complimentary colours have high contrast but, when placed 
side-by-side, can have negative consequences. Interestingly, the 
pastel variants do not sufler these consequences, 


Infinite Colours to Choose 


Choosing colours is difficult. 


‘There are an infinite number of colours to choose from, 
bbut we need to select just 4-12 colours, which need to meet 
our three requirements. 


Complex decision-making is exactly what we build 
‘machines to help us with. To start building our 
information machine, we need to first organize everything, 
‘we know; all we know so far is a fundamental truth of the 


‘There are a lot of colours to choose from 


‘To help us sort through the options, we need some way of 
organizing them. That is a fundamental thing 
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programmers do; they organize stuff. By organizing 
things, we can simplify our problems, 


(PRISMATIC 


Moses Harris, he Natural System of Colours (1776) [Wikipedia: Color 
Whee Colour wheels have long been = tool for grouping sila colours 


So we first need to narrow the set dawn; let's only pick. 
from colours humans can see: this process can be 
represented with a Colour Wheel. 


Colour wheels represent the variations of colours that can 


>be made by mixing different base colours together. The 
base colours are placed around the outside. 
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While not representative of every colour possible, these 
colour wheels give us many colours. They also have the 
advantage of placing similar colours physically close 
together. 


‘This helps us satisfy two of our requirements: 
1. Sufficient: If we need many, there are many. 


2. Differentiation: the further they are physically, the 
more different they are, 


Mechanically Making 
Decisions 


"Narrowing down an infinite number of colours to just the 
‘4-6 weneed is complex, and building a machine to choose 
for usis the point of the exercise, Unfortunately, 
describing the machine and how it works mathematically 
sounds like a bunch of abstract symbol manipulation, 


Rather than trying to explain it, let the demonstration be 
its own proof. Besides, this machine is realy simple 


like... really simple. 
Building the machine is way easier than explaining it. Once 


you make it, how the colour picker works should become 
evident. 


Gather your supplies 


Get a pencil. sharper is better. Also, make a colour wheel, 
and trace out a blank circle on a separate sheet of paper. 
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2. Draw a Spiral 
Using the blankcircle, starting anywhere along the 
outside, begin tracing it gently with your pencil When you 


hhave a feel for the circle, draw an evenly-spaced spiral 
toward the middle. 


3. Do the calculation 
Place your pencil onthe topmost tick-mark of the circle. 
‘Now, move your pencil thre steps, and mark the position 

1. across the circle, tothe opposite side 

2. over one tick-mark (either way) 

1. move none step toward the middle 


2. markthat position 


a 


‘Now, starting from this new position, 
4, move across to the opposite side 
2. move over one tick-mark (same direction) 
3. move in two steps toward the middle 
4, markthat position 
Did you see it? The slight change in the repeating pattern? 


‘Keep moving around, marking points on the chart until 
you reach the middle. 


(oreo the reasons ne matinee they rapetoetans for us Deering 
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4. Make it permanent 


When you think you have all your dots in the correct 
location, carefully. 


(make sure you are careful: this is both delicate and 
dangerous) 


place your pencil over the first dot and. 
(don't mess it up: we are building computers here) 
Ram that pencil through! 


Seriously, don't be gentle. Punch a good-sized hole in the 
paper. 


(ust be careful not to put your hand in the way) 


Do that for each of the dots you made. 


Reading the Results 


Align your spiral graph with the colour wheel, and watch 
your colour palette reveal itself. Start from the first dat 
‘and copy down each colour your program chose. 


‘That is your unique colour palette to use in your 
visualizations. 


Remember when we started, we were looking for acolour 
palette that satisfied three criteria: 


1. High Differentiation 


2. Sufficient Colours 


bh 


3. Aesthetically Pleasing, 
High Differentiation 


By starting with a colour wheel, we locate similar colours 
physically close together. Our algorithm then travels far 
across to select physically far apart colours. 


Sufficient Colours 


‘The spiral pattern ensures that we never select the same 
colour twice. No matter how many colours we choose, we 
always have different colours. 


You may notice that the differentiation of colours 
decreases as you approach the middle: the more colours 
you pick, the more similar they get. This is a compromise 
‘we need to make. It is abalance between having lot and 
having them be different. 


Aesthetically Pleasing 


‘This is always tough; what is pleasing to me may not be 
pleasing to you. 


‘There is no way to have our simple computer decide 
‘whether the selected colour is pretty or not. That is, 
something best left to humans. So take a good look at your 
colours, and determine if you like them. You may have 
gotten a good starting point by luck, but maybe you did 
not. 


ne last thing you should try: poke a small hole right in 
the middle, 


‘Now, you can rotate the selector wheel around the colour 
wheel to fine-tune your selection process. 
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Conclusion 


Don't just read about it, doit. 


‘This isa simple activity that lets a person feel how 
programming works. Have a younger sibling? Do it with 
them, and watch how they see the activity come to life 
because that is part of programming, too: helping people 
see the value of automated decision-making. 


Variations 


recommend physically making your own colour whee! 
with pencil crayons or watercolour. You could draw a 
colour wheel programmatically. All of it should be 
hand-drawn to maximize comprehension, 


Anatural extension of this is to write a function 
{programmatic or algebraic) to pick the colour. This would 
require you to define the colour wheel space and traverse 
the space in a spiral. HTML colour codes based on RGB 
‘make a good return value. 


If you want a challenge, move into a more complete 
3-dimensional colour space. RGB defines three 
dimensions: a function that selects from a colour sphere, 
with a 3~dimensional spiral search of the space...even I'm 
not sure how. 


Further Reading 


While this represents one way to select a calour palette, it 
isnot necessarily the best. Havea look at some other 
considerations: 
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Color Wheel Pro: Color Theory Basies 


Graphiq: Finding the Right Color Palettes 


for Data Visualizations 


‘Acolor palette optimized for data 
visualization 


When I came up with the idea for constructing this, 
contraption, I was thinking about the 


Astrolabe: a computer from the first 
‘century, developed in Alexandria. 


Ifyou are looking for a digital variation of this tool 


fora} 
ee 
one 


‘Though I suspect building it yourself would be more fun, 
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Using WebGL to Solve a 
Practical Problem 


An introduction for Dummy Programmers 
(using the Smith-Waterman Algorithm) 


‘+ Anexplanation of why GPUs can significantly 
increase processing speed 


‘+ Asimple demonstration of a GPU implementation 


Complete Code @ cadesandbox 
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Some time ago, Iwas teaching introductory Python and 
basic browser programming, During this time, I wzote an 
{@Y{9/409 application that compares pieces of software code and 
24 presents their similarity” inaforce-directed and tornado 
a7 diagram. Iran this software semi-regularly (weekly), and 
“6% 4 significant problem appeared with my solution very 
early on, 


It took really long time to solve. A really long time. A 
painfully long time. 


Ineeded to find a way to speed up the processing, 


. 
ae 
. 
. 
. 
> ee? 
. 
an 
ee 
. 


“The points are student submissions and the ins represent the lvel of 
‘lary With 36 comparisons, this takes long enough that get 
‘bored 


When I wrote the tool, GPU processing was hot, and 
everyone was talking about how this would speed up 
everything. No matter what the question, GPU was the 


 hutps:/jefferey-cave gitlab io/miss/ 
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answer. This was an obvious avenue of investigation. 
However, Thad decided to write this tool in the browser 
(legal and ethical constraints), and browsers do not have 
direct access to the underlying hardware. 


So, ifall the cool kids are using GPUs, and this is written in 
the browser, I'm intensely curious .. it looks like I'm 
learning WebGL and GLSL.” 


Pre-Requisites 


[Before beginning, you should be comfortable with 
programming. The demonstration is written in vanilla 
Browser JavaScript, so no particularly advanced 
techniques are used; however, using WebGL requires 
switching between two languages and compiling code. 
Web programming does not usually involve those things. 


‘he only programmatitechniqneyou should be vaguely 
familar itwedecelularsuomaton Consage SES 
Game of Life is the classic example of this. GoL has beena =f3W; 
Sapleofpepammingivtratorsfersoyensbecuse | EE 
the problem slaty simple the solo complex 
tnoughtocxercse stdent sls an theeupus 


Inaddition, I strongly recommend going to the local office 
supply store and buying a cheap pencil, eraser, and pad of 
grid paper. Nothing builds understanding like working 
through problems yourself 


= MDN Web Docs, GLSL Shaders 
https: / developer. mozilla.org/en-US/docs/Games/Techni 
ques/3D_on_the_web/GLSL_shaders 
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How GPUs Speed Up 
Processing 


GPUs area completely distinct mechanism from CPUS. 
(CPUs are designed in such a way as to offer many 
‘operations to people and allows you to run them one ata 
time. GPUs offer fewer operations, but set them up so you 
can runa bunch of them simultaneously (parallel 
processing). 


‘This comes at a couple of different costs to us 
programmers. 


1. It's like working ona different computer. 


2. The instructions we write for one don't necessarily 
exist for the other. 


‘That's annoying, but... parallel processing: as longas they 
all run the same set of instructions, you can run a 
calculation a couple of thousand times, except 
simultaneously. Very simply put: 

GPUs do parallel processing of a single function. 


‘Technically, the function is called a""kernel"; I referred to 
itas a program in my code. 


Consider the following function: 
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function MultiplicationTable(size=18) { 
Let table = Allocate2DArray(size) ; 
For(memLoc.x=@; menLoc.x < list.length; menLoc.x++){ 
for(meaLoc.y=0; menLoc.y < list-length; meaLoc-y++){ 
table[mentoc.x][menLoc.y] = menLoc.x * menLoc.y; 


) 
) 


return table; 


Parallel processing on the GPU is about doing the same 
action simultaneously. In this case, the multiplication is a 
process that is consistently the same. 


will do some basic math: a 10 x 10 array costs us 100 units 
of processing time 


‘Now consider doing the same processing using the GPU 


function MultiplicationTable(size=10){ 
let table = Allocate20Array(size) ; 
table = gpu(table) 
-forEach((menLoc)=>{return menLoc.x * menLoc.y;}); 
return table; 


) 


‘That forEach costs one (1) unit of processing time, no 
‘matter whether it is 10x10 or 18000x10000. 


I made that code up. Itwon't work, butit does give you 
some idea of what we are trying to work toward. No matter 
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how big we make "table, it will take1 unit of processing 
time. 


Using GPUs 


GPUs are mechanically different from CPUs. 


‘Because of this mechanical difference, itis helpful to think 
of GPUsas a completely separate computer you are 
attached to. Not only do you have an individual processing 
‘unit (GPU instead of CPU), butt also uses separate 
‘memory and a separate instruction set. 


‘These three elements of separation mean there are three 
primary phases that we need to go through to make use of 
them: 


Man _—s 
ee @ 
(@)(@) 
©)\e@ 


(compilation) 


2. Exchange memory with the GPU space (transfer — 
read/write) 


3, Execute the instructions (execution) 
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‘Managing these three phases is WebGL.'s most challenging 
and complex part. There are a significant number of 
details that need to be handled just to exchange 
information with the other space. 


Part of this is because the 'G' in GPU and WebGL stands for 
Graphics, We are using something designed for 
‘manipulating images to do general computation. The 
details that need to be managed revolve around defining 
elements of an image; this means we need to describe our 
raw numbers in terms of an image. 


‘This is simplified by creating helper functions that 
describe our data 


psGPU 


Ahelper class was set up in the demonstration called 
[pSGPU [nebgl_.htm: 188]. Ithas a few functions that 
abstract much of the configuration away: 
addProgram 


Compiles and sends a block of GLSL code (as a string) to 
the GPU space. [webg1_.ntal:284-4S6] 


AnitMemory 


Creates a hidden “image” that will act as our processing 
‘memory [webgl.hesl:289-246] 


weite 


‘Transfers our memory (UEnt@Array) over to the GPU 
space [webg1. html :254-263] 
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read 


oy 
SP Transfers our memory back from the GPU space 


[webg1 emt 241-252] 


HO} 

SF executes the program we compiled {ebgl.htmt:266-287] 
* sanovice, these functions became critical to setting up 
the GPU. Iwas intensely interested in implementing an 
algorithm, and the complexities of memory management 
significantly distracted the complexity I wanted to focus 


pixel 


BpHiF2 atsome point a second helper class was created called 
Bae EI pixel. (webgl ntnl:96] 


GISESS she interchanged memory through read and write consists 
of abyte array. Interpreting an image asa byte array calls 
fora few more helpers. in particular, each image pixel is 

interpreted as 4 bytes, representing the pixel's Red, Green, 

‘Blue, and Alpha (rgba)" values. Within the GPU, these 

values are defined by the type vee, a collection of 4 values 


™ (rgiba). 


‘The pixel class was created to help maintain consistent 

naming across the CPU/GPU boundary. It is just a 

convenience for mapping the returned UInt@Array to the 

‘sbytes representing a given pixel, allowing them to be 
E6520 referted to by the same rgba 


5 = MDN Web Docs, Color, RGBA 
‘https:/ developer mozilla.org/en-US/docs/Web/CSS/color 
_valuetizgba%28%29 
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DrawGrid 


‘The most interesting (maybe “useful” is abetter word) 
utility function is DrawGrid’. (webgl.nem1 828-283) 


[Because GPUs are designed to manage images, the only 
inspection of memory changes available is by looking at a 
picture, Since the purpose of this project has nothing to do 
with images, colour is not a meaningful representation, 
‘This makes debugging trickier. 


‘Tohelp, ‘DrawGrid’ only renders each pixel location as its 
underlying numeric value. Its roughly equivalent to 
JavaScript's “console. 10g" allowing the developer to 
dump a set of values toa visible location for inspection. 


tis most effectively used by placing it (and a break-point) 
immediately after a kernel ‘run’. Remember to comment 
it out when measuring speed. 


Actually Running 
Once the helpers are in place, defining our processing 
functions and calling them in the correct order isa simple 
matter. 

1. Send the instructions in the GPU space 

2. Exchange memory with the GPU space 

3. Execute the instructions 
Given the amount of helper code in place, few instructions 
are left to perform in the actual code portion on the CPU 
side. The really interesting logic should be moved to the 


GPU; all the interesting processing should reside in the 
kernel definitions. 
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On the CPU side, we send the instructions to the GPU 
(addProgran) and then repeatedly send notifications to 
execute the function (run) 


The General Problem 


EEE comparing code for similarity isa well-solved problem. 
FL comparing sequences of tnatractons while considering 


822 minor variations sounds very similar to DNA comparisons. 
1 in genetics, sequence alignment algorithms have been 

around for along time (Needleman-Wunsch dating to 
about 1970). Comparing DNA sequences for similarity 
‘while considering minor variations due to mutation or 
EGE) cross-over isa common goal, which sth same problem 

"52k we are trying to solve. Think of this asa paternity test for 
software 


Inmy case, I reached for a Smith-Waterman comparison, 


§ The best way to understand an algorithm is to solve it with 
"pencil and paper. In this case, I spent much time with a 
pencil, eraser, and pad of grid paper from the local 
convenience store. 


‘Take the names of two animals: coelecanth and pellican” 


® Not the Biolnformatics class I took, but a good slide on 
the subject. Contained the example of pelican and. 
coelecanth, It is worth noting that the spelling of 
“Pelican” is incorrect in the demonstration. This is the 
present author's error, but does exemplify the pattern. 
‘matching capabilities 
Inttps://slideplayer.com/slide/5142106/16 images/21/Smit 
h-Waterman-Algorithm.jpg 
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(on the surface, they do not appear similar; however, closer 
{inspection (surprisingly) shows they do align reasonably 
well: 


coe.lecanth 
p-ellican 


Programmatically, we can find this alignment by solving a 
‘smith-Waterman matrix: 


‘To understand how this alignment can be solved 
‘mechanically, I recommend following the example given 
‘on Wikipedia’ I don't mean go read the Wikipedia page; 1 
‘mean pull out that grid paper and reproduce the example 
for yourself. Solve each step and verify it against the 
example. If you madea mistake, spend some time 
understanding your error, and start over. 


When you can solve a Smith-Waterman for yourself, you 
have proof that you understand it. 


While the implementation is discussed in detail below, itis, 
worth having an intuitive understanding of one of the two 
processes involved; trying to lean both simultaneously is. 
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harder. Ifyou are interested in implementing algorithms 
on the GPU, already understanding the example algorithm 
isuseful. If you are already a master of the GPU and are 
interested in implementing Smith-Watermans, this basic 
example may be a good stepping stone. 


Put it together 


Itisn'teasy, Using the GPU requires us to thinkin parallel, 
and this requires us to think in ways that we aren't usually 
used to, Things we take for granted in linear processing 
aren't available to us; things we would typically avoid, we 
accept for the sake of being able to use the tool. 


‘The idea for how to do this actually came from one of my 
favourite college assignments: Conway's Game of Life 
(GoL). In Gol, each life-form location changes its state 
‘based on the state of its nearest neighbours. This is usually 
solved ina grid represented by a table of values on the 


‘The key to that statement is that each automata cell has a 
state resolved independently of all the others and based on 
the values of its nearest neighbours. That pretty much 
describes the Smith-Waterman as well. The only real 
change is that Smith-Watermans only consider the 
neighbouring cells in the upper-left corner (North, West, 
and North-West), 


‘These parent values need to be calculated prior to being 
able to calculate values, representing one of the challenges 
of parallelizing the algorithm: you cannot calculate 
dependent values in parallel. Te first mental 
breakthrough was an animated GIF I found online that 
demonstrated a diagonal parallelization of values in an SW 


60 


‘matrix (the original reference is lostto the recesses of 
‘memory now) 


‘This was compounded by the fact that Ihad initially 
optimized my algorithm for low memory consumption. To 
do this, I maximized the early release of the memory 
associated with a given cel if it was not part of a chain, 
One of the challenges in using a GPU was determining how: 
to arrange the memory so that calculations were only 
performed on elements with a complete parent set. 


Working dagonaly allows us to maintain sufiient parent calls (green) 

to calculate the number of cd cells (elo) n parallel. Processing tis 

‘onthe GPU means that alo the other cells white) wil be caleulated for 
noresson, A reasonable tade-oft 


‘This focus on optimization blinded me to the fact that the 
cost of evaluating a matrix with a GPU is 1, regardless of 
the matrix size. The original speed using CPU calculations 
required that every cell be evaluated one ata time: width 
‘+ height processing time. Using the GPU and calculating 
all of the elements each time felt like wasteful work, but at 
some point, the realization it was still oly width + 


a 


height processing time dawned on my dummy 
programmer brain, 


Who cares about wasting a bunch of processing effort when 
itsaves that much time!? 


While I'm sure there are efficiencies to be gained, they are 
insignificant compared to the speed increases of just 
evaluating cells needlessly until the entire matrix is 
solved. 


Once this situation is accepted, it becomes reasonable to 
create the GPU function (or kernel) smithwaterman that 

‘© can solve for an individual 2-D matrix cell 

[webg1 528-586). An initialization routine was also 

created, calculating the initial match value 


ra [nebo] 687-710 
wg (meee 1. 
= 


compile the code, but unfortunately, it means no syntax 
highlighting, It also leads to very eryptic messages about 
invalid syntax. As you make changes, make them small to 
ensure you can identify where a syntax error was made. 


‘The implementation ofthe algorithm itself s minimal 
Initialize the Memory 


Before we can start acting on values, we need to transfer 
the values to the GPU. While addProgram is used to write 
the code to the GPU, writing memory s performed bythe 
‘write function. We tartby fetching an appropriately sized 

20 array and then filling only the arrays top an left portions 

FE (ebg1 735-744). Ths minimizes CPU cycles by leaving the 
iterative portion tothe GPU. 
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‘The initia Matrix score is dane by comparing the extreme North and 
Wiest values. Matches gt a hase score of 2 while everything else is set 
too 


Our first GPU task (Ani tializespace) is to compare the 
intersections of these values for matches {webg1:657-710] 
For each cell, we look atthe value tothe extreme west and 
north {ebg1:760] and assign points foramatch or 8 
points fora mismatch [vebs1:707]. This scoreisstoredon 
the red channel. 


tis worth noting that the GPU does its processing in 
terms of fractional values (Float): everything is a portion 
Of 1. So, while the scores are intended to be the integer 
values @ and 2, these must be consumed as some 
proportion. The values are passed to the function as 
0.0/255.0 and2.0/255.6, making them easily 
convertible between a UInt8 and float. 


‘This is an important thing to remember. For the purposes 
of this algorithm: 
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Everything on the GPU side treats the 
numbers as floats, but the returned 
memory is an integer. 


Solve the Matrix 


‘Smith-Watermans construct a chain of values, 
representing the best matches. In this case, "best" is the 
neighbour with the highest running score. 


‘The first step isto look up the nearest neighbours, which 
requires us to determine how close those neighbours are. 
‘The GPU thinks in fractional values (float), while we 
thinkin discrete values (int). We need to calculate the 

1B fractional size of amemory location (a pixel) 

® [nebg1 544]. Once this is done, we can look up the value of 
the current cell (here) and its nearest neighbours (nw, n,w) 

AEN gebgt 542) 


Once we have identified the critical neighbours, we can 
assess their values. This is done by checking all three for 
the highest score and temporarily storing it on the blue 
channel (webg1 562] 


Looking atthe nesrest neighbours (yellow) we can determine the 
tirectin of the match nth ease horzantal (127) the best match. 
Values are stored as fractions of 255 (255/2 ~» 127) 
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Knowing which score was highest allows us to determine 
which direction forms the chain; each direction is tested to 
see ift forms the desired chain, In the event of a tied 
score, diagonal matches should be favoured because 
horizontal and vertical matches represent a skip; 
horizontal and vertical ties can be resolved arbitrarily. 
Directionality is represented by enumerations of 1-north, 
2-west, and 3-northwest and is stored on the blue 
channel (webs1 568] 


Knowing the direction ofthe chain, we can now tally up 
the running total. The running score ofthe chain is added 

to the local matching score; logically, this is done after the 
direction is recorded; however, ducto only having 4 

‘memory locations per cell itis pulled from the temporary 
value stored on the blue channel before we finalize 

direction [webs1:565). We also apply a skip penalty (-1 

point) to chains that had to perform a skip operation 
(non-diagonal direction) [nebg1:579], 


an na 2A 

Having identified the parent, we can ad the runing score, tothe local 

seore and aply any skip penalties. In this case (2425), which then gets 
stored onthe current alpha chan! 
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Once the calculation is complete, the values are stored 
permanently to the current cell in the alpha channel 
{[webgl:584]. Itis worth reviewing that there are 4 memory 
locations per cell and how we have allocated them: 


© red: local matching score 


‘green: unused (reserved for future use) 


© blue: chain direction 


alpha: chain score 


eo 


“This diagram shows the calculation wave as it moves across the mati 
‘Green values represent cells with sufcient information to salve, yellow 
‘values represent values that have settled int ther final state, and white 
values are ones that are indeterminate, Every cells calculated on every 
‘yee. Using this pattern, we can reduce the numberof required cycles 
from 47 10°ty 


‘As noted earlier, itis not sufficient to execute this process 
‘once. While we are calculating every cell's chain score, 


there is insufficient information for the last cell to 
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complete its chain calculation until it's neighbours have 
completed their calculation. To resolve this, we run the 
GPU processing multiple times, first calculating the 
‘worst-case number of cycles required [webg1:518] and 
then sending processing signals to the GPU in a loop 
{webgl:752]. Each iteration of GPU processing moves the 
‘wave of completed calculations forward one step. 


Conce this loop is complete and the tip ofthis wavereaches ©: 
the bottom-right of our matrix, the first phase of the 
calculation is complete. Mast importantly, it was 
completed in x+y cycles rather than xty cycles. While the 
test samples are too small to take accurate readings (ams 
resolution in most browsers), the animals sample went 
from.oms to under a millisecond, In contrast, the lorem 
sample went from 5,5 seconds to about 1 millisecond; no 
initial readings were taken for identical, Longehain, or 
oilbertsulivan [webgl:477] 


_Even accounting for significant measurement errors, this 
isa considerable improvement! 


 & anu) 


‘We can seea significant improvement in sped njust the fst 30 
seconds of processing 
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Conclusion 


‘Some of this feels like a dirty hack; for example, working 
around all the references to RGBA feels weird. WebGL was 
designed for graphics, not computation. For the 
adventurous, this lends a sense of excitement and 
challenge. 


‘This takes me back to my (very brief) days of working in 
°C, where memory manipulation is a little closer at hand. 
Conforming to vec4 memory (gba) really encourages you 
to think of new ways of using (and abusing) the way you 
use memory, reusing memory, or squeezing that extra bit 
of information into an incompletely used byte (most of 
‘which has been refactored out of the example). 


While I have loved Conway's Game of Life for decades, 
never thought I would find a practical purpose for cellular 
automatons. Having noticed the similarity between this, 
problem and Gol,, Inow want to revisit implementations 
of aerosolized particulate dispersion models.* 


Lastly, the most educational part for me was the value of 
pencil and grid paper. A lot of debugging revolved around 
solving the grid with a pencil and comparing the resulting 
‘manually solved matrix to the program's solution. 


Solving Smith-Waterman's by hand is also like doing a 
giant Sudoku or Crossword... kind of fun. 


Next Steps 
1. Open in the browser 


smoke js is an example of a dispersion model. 
bttps:/fomelli.ug/smokejs/ 
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2. HitFI2 
3. Insert a Break-point 
4. Start stepping through the code 


Inthis case, I suggest saving a local copy first to allow you 
to make minor changes to see the effect. 


‘The performance gains I see by implementing the 
algorithm on the GPU were significant: approximately 
'5000 times in terms of speed. However, the more I ponder 
the problem, the more ways I see to improve it. 


onthe other hand. 


‘This tool was written as a personal utility to meet 
individual needs. Until there is more interest in MLSS", Higa5i 
‘there is likely little reason to implement performance 

gains. 


1's fast enough..for now 
Looping 


[suspect my calling of run in a loop is inefficient. 
Implementing the loop as part of the GPU code would 
likely be better. However, there are two reasons I did not 
do this: 


1. The current implementation allows for periodic 
data reading and progress bars in the visualization. 
(orange connectors). 


Syeasure of Similarity of Software of Students: An in 
browser code comparison utility 
https:/jfferey-cave gitlabio/miss/ 
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2. Input and Output memory is declared before 
execution. In the helper functions, input and 
output memory are swapped after every run. Idon't 
know how to do this within the context of a single 
execution. 


[Neither of these issues seem insurmountable. 
Chain Resolution 


‘smith-Waterman calls for two phases 
41. Build the chains 
2. Resolve the chains 


‘This article intentionally ignores the second phase of a 
‘Smith-Waterman, the code that reads the chains back off 
the memory. The current implementation in MLLSSS. uses 
pure CPU JavaScript to resolve the chains. However, a 
recent discussion in the office made me rethink how that 
‘was done and inspired me to rewrite the chain resolution 
function to use the GPU more. 


2) encourage readers to loka the GPU function chain 
® (nebalhe=t 560-695] to see its implementation. Ituses 
very similar techniques to those already discussed. 


Memory Consumption 


Building a 2-D matrix in memory means height times 
width, That is going to grow quickly, depending on your 
‘input. Also, as the number of tokens grows, there isa risk 
that the numbers representing them will exceed 65535 (2 
bytes) 


0 


(ne ofthe intial validation tests was to compare the 
‘genomes of E.Coli and ¥.Pestarius” (available in the 


cats up memory, and an 8TB array exceeds my laptop's 
capabilities. 


{would love an implementation that cuts the giant matrix 
{nto a series of smaller ‘tiles (16000x16000 for about 
2GB?). This would ensure they never consume more 
‘memory than is available and never generate a token 
identifier greater than 2 bytes 


‘Tokens could be mapped to an index not exceeding UInt16 
for the given tile. The tiles could be solved independently 
(in parallel if you have more than one GPU), storing only 
their internal chains and edges. The edges could then be 
stitched together during chain-resolution. 


It's an interesting idea, and I'd love to see someone run 
with it 


Further Reading 


Unfortunately, most of this was done as a personal project 
almost 2 years ago, so many of the references and tutorials, 
used have been lost to the mists of time (mists of time is 
about 20 minutes, in my case) 


% National Library of Medicine, Escherichia coli LE82 
chromosome, GenBank: CU651637.1 

https: //wawnebi.nim.nih.gov/nuccore/CU651637.12report 
fasta 

National Library of Medicine, Yersinia pestis KIM20+, 
GenBank: AE009952.1 

https: /iwww.ncbi.nlm.nih.gov/nuecore/AE009952.12xepor 
tfasta 
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‘The Helper Functions 


1f you want to expand on this, I recommend investigating 
my helper functions. There is alot of “stuff” going on. 


Smith-Waterman 

Unfortunately (or fortunately), my efforts to port and my 
lack of understanding resulted in a mash of 
non-functional code. The efforts to debug created my 


understanding but resulted in something that looked 
nothing like what was started with. 


Pe 
SROHE! GitHub: Checksims 
Gea! 


‘Much of my learning comes from simply stepping through 
the code samples given on Wikipedia and working through 
my own examples, 


Wikipedia: Smnith-Waterman 


Wikipedia: Conway's Game of Life 


R 


Understanding the algorithms came by writing code in 
PoJS (Plain old JavaScript) without adding the complexity 
of a GPU. Now that I know GPUs better, I think it would 
hhave been the more straightforward solution... hindsight is 
20/20, so who knows? 


‘WebGL 


BENG webct Fundamentals: was the primary set 
"PAE ofturorals followed to figure out hove to 
SE, dongs with webct 


[pv] Mozilla Developer Network: This is the 
F251 defacto-standard reference for 


ERE250 prowser-related things and includes both 
@xe tutorials and generic references for WebGL. 


Libraries 


WebGL is relatively new, and WebCL™ (Web 
Computational Language) is still awork in progress. ae 
Fortunately, several libraries have been developed to make 
WebGL more computationally friendly. 


tensorflow js: Google's famous library for 
‘machine learning... implemented in 
JavaScript 


* WebCL Overview, Kronos Group 
https: / ww dnronos.org/webc]/ 
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‘TWGL: psGPU was meant to turn into 
‘what TWGL is, If were to solve this 
problem again, I would use thi library. 
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Avoiding Psychic 
Software Development 


A tribute to James Randi for dummy 
programmers 


Inspired by Randi's skepticism, let us use critical 
thinking in software development 


‘Skepticism towards claims of task completion in 
software development and promises of quick fixes 
are emphasized. 


Popoff's exploitation of faith and the ADE~651's 
ineffective technology showcase the dangers of 
blind belief. 


Randi's legacy prompts reflection on the 


Importance of evidence-based decision-making 
and skepticism in the face of extraordinary claims. 
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‘The Amazing Randi was a successful stage magician, 
famously surpassing many of Harry Houdinis 
achievements. Later in his career, James Randi took his, 
‘mastery of magic and used it to tur a critical and skeptical 
eye to claims of paranormal powers. Claims of divine 
healing powers, telekinesis, and psychic powers were all 
put tothe test with what grew to a $1,000,000 reward to 
anyone who could prove their powers. 

sane 


“a ‘The Randi Prize was never claimed during its 50 years.” 


"Modern Product 
Development 


é Bs 


[Edward Deming once said, 


In God we trust, all others must bring data, 


‘Over my career Ihave become infamous for citing Deming, 
but if this is atime for confessions, I must confess that my 
knowledge of Deming came much later than my attitudes 
toward evidence based decision making. Rather, Ican 
attribute my personal distrust of claims and constant 
demands for evidence to people like James Randi showing 
‘me how, and when, to be critical of my own beliefs. 


Both Deming and Randi demand that we approach our 
observations critically and skeptically manage our own 
biases. 


* James Randi Educational Foundation 
hitps://web.randi org/home/jref-status 
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‘This has impacted me professionally in two significant 
ways: 


1. Tam suspicious of software developers that claim 
they have finished a difficult task 


2. Tam suspicious of consultants promising to make 
problems go away quickly and cheaply 


I say this as someone who has both development and 
consulting in his past. 


A Personal Confession 


Inmy youth, Iwas fascinated 
by psychic powers. [bought all 
the books on remote viewing, 
developing my psychic 
powers, and becoming a 
‘medium. I recognised that 
information was a powerful 
tool and was interested in any 
‘means to acquire more of it 


‘The problem is that, like Fox 
‘Mulder (X-Files), I've always 
wanted to believe, Lewellyn publishing can account for 
much of my allowance. 


‘Unfortunately for my desite to believe, at some point saw 
the now classic episode of What's My Line with guests 
james ydrc and James Randi” Inthe show, Randi is 
Sheptial and lat out tates that Hyde's clkinti abilty ye EH 
totum book pages amounts tim blowing onthe pages, 


® Classic Hoax: Psychic James Hydrick 
https: //wafflesatnoon.com,james-hydrick-psychic/ 


19 


and introduces some light weight Styrofoam around the 
book. He is firm and calm, and completely unrelenting in 
his stance, and the scene eventually gets uncomfortably 
awkward as Hydric begins a convoluted explanation of his 
ensuing failures. 


Hydric isa bit ofa tragic character. While he was obviously 
a fraud, it appears he became so as an attention seeking 
behaviour. Watching his confessional interviews after the 
Randi event, I got the impression that he had come to 
believe his own hype; that he had misled even himself. 


Whatever the case, Randi's critical approach to assessing 
paranormal capabilities has haunted my ability to blindly 
believe with a shadow of skepticism, 


Software Developers 


Inthe process of developing software, developers are 
expected to produce solutions to problems that have not 
been solved, That is the nature of the craft. However there 
are corporate expectations that the problem be solved ina 
reasonable amount of time, because time is money. This 
places pressure on developers to be finished, and this, 
creates the risk space: scared for our job, eager to please, 
‘wanting to appear skilled, we want it tobe done too. 


‘So we tell our managers that the jab is complete. 
We are physically incapable of seeing that it does not 
completely solve the problem, or that iti too difficult to 
use in its current state. Like those that are healed by Faith, 
‘we want itto be true 


‘This is where processes and philosophies around the SDLC 
come into play. 
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‘+ Issue boards keep us from reporting more progress 
than we have actually achieved 


‘© Sprints and backlogs keep us focused on the most 
pressing issues 


‘© Automated software buildtest/deploy (C1/CD) 
ensures that evaluation is unbiased and 
reproducible. 


While the debates continue about which controls are the 
best, there is no doubt that we need the controls. Like 
James Randi placing Styrofoam around a phone book, 
these controls ensure we are being honest... even to 
ourselves, 


Faith Healing 


_Even assuming it had been real, Hydric's ability to turn 
phone book pages was little more than a novelty act. It 
‘may have sold a few books, but debunking it was not 
exactly an earth-shattering revelation. A more significant 
case can be found in that of Peter Popoff. 


eter Popoff isan evangelical minister whose television 
broadcast became famous for his claims of divine 
Jnowledge and healing abilities. During his shows he 
‘would call arbitrary individuals from the audience by name 
and cite details oftheir life with no prior knowledge. As he 
approached them, and they began to stand, he would tell 
the gathered audience what terrible diseases the person 
hhad and that he would heal them, Both the knowledge and 
the healing were claimed to be directly imparted divine 
powers. 


Rather than being divinely inspired, Randi discovered that 
opoft's wife was the true source of knowledge. Prior to 
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the show, she would gather information from attendees 
and was broadcasting their names and information via 
radio to an ear-piece he wore. 


Here lies a case of fraud that demonstrates true harm, 


People that sought out Popoff truly believed God spoke to 
hhim, and that his hands could remove illness from them, 
[Believing that they had been cured meant that they would 
stop seeking treatment for arthritis, or epilepsy, or heart 
conditions. When one adds the $4 million a year in 
donations people made, Popoff has made ita little more 
difficult to replace the pills he told them to throw out. 


James Randi, did thousands of people a service when he 
exposed Popoff. Though we do need to ask why Popoff 
continues his healing to this day. 


Consultants and Vendors 


‘ne way management can reduce costs is to hire outside 
expert consultants that understand the problem better 
than the internal staff, and nobody knows the solutions 
better than the vendors, Naturally, as experts they are to 
be paid more than internal staff, this is justified by their 
being more knowledgeable. 


Unfortunately, this is often not the case. 


Due to the short time frame they are present for, 
consultants are not actually paid for measurable results. 
‘The measurable results oftheir suggestions and changes 
come after they have left. Their real rewards are tied to 
‘making the manager that hired them feel good about their 
decision. This does not necessarily mean they were 
successful at solving the problem. 
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Like the Faith Healer, consultants can reap huge rewards 
for making promises of solving problems, and making 
their audience feel that the problem has been solved 
through some special conference from an authority. 
Unfortunately, like the Faith Healer, this can be, and often 
4s, done as an act of faith. 


In fact, itis almost impossible for this to be undone 
because the person that has paid out their life savings to be 
healed, or the manager that has spent a significant portion 
of their budget, cannot admit to themselves that they were 
swindled, 


‘The more we pay, the more we want to 
believe. 


Thave worked with some great consultants over the years, 
but I've spent more working hours with bad ones. The 
reinforcement cycle is one in which the best rewards go to 
those who make management feel good. Unfortunately, 
they are also the ones that consume the most time in 
fixing and retrofitting good solutions around their popular 
‘one. Given they are paid by the hour, this means the 
feedback mechanism benefits the dishonest. 


‘There is no easy answer to this except to be skeptical of 
‘smooth talking salesmen that echo what you already want 
to believe. Often consultants are brought in because local 
staff have been asked to solve a problem and have given an 
undesirable response. Unfortunately, thatis often the 
honest, but hard-to-hear, truth. 


Like James Randi upsetting a lot of Peter Popoff's 


believers, the truth can be hard to hear, but healthier for 
youin thelongrun. 
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Faith Healing in Modern 
Times 


In reminiscing about the impact of James Randi, naturally 
{tured to Wikipedia to refresh my memory. Ithas beena 
long time since I have had need to know about Rand's 
‘work: a different time. We now rely on scientific 
reasoning, and no longer believe in psychics and faith 

scsenagy healing. We no longer make business plans based on gut 

SEE PAR feelings but rather collect data o provide a basis for 


‘That wasa different time, a simpler time. 


‘Imagine my shock to learn that frauds Randi has exposed 
continue to be active as recently as 2015, with terrifying 
consequences. 


According to Wikipedia, the ADE= 61s an explosive 
detection device thats used intemationaly to keep people 
safe fom terrorism Nataly, people want tobe safe fom 
terrorist bombs and put ther faith in technologieal devices 
toprotet them, Randi fist challenged the developers of 
the device in 2008, andithassincebeen demonstrated to 
be ineffective, tothe point of containing no operating 
machinery all The FBI has repeatedly issued bulletins 
(law enforcement o stp using the device In spite ofthis, 


OR 
Sioa G continues tobeused avalfesaving device by see 
BROS counties sa local ot enteric agen 


™ Data are not take fr meuseum purposes, The Deming Institute 
tps: /deming orgldata-are-not-taken-for-museun- purposes 
theycar- ta eva-basis-for-doing-scmething! 


SHC Dismisses Petition for Ban on ‘Fake Bomb Detectors! 
|htps://propakistani.pk/2019/12/13/she-dismisses-pettion-seek 
ing-ban-on-fake-bomb-detectors-in- pakistan 
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People are so desperate for it to be true, they just will not 
let it go, and people are dying as a result. 


‘The false sense of security provided by the device 
had catastrophic effects for many Iraqi people, 
hundreds of whom were killed in bombings that the 
ADE 651 failed to prevent 


— Wikipedia: ADE-651, Investigations, Iraq 


Perhaps people cling to it for hope, perhaps they cling to it 
for vanity, but in these modern times, people are paying 
millions of dollars for devices that end up getting them 
killed. 


Reading about the ADE~651 1am reminded that there is no 
quick cure for superstition. We still want to believe that we 
are finished, and still want to believe that we are clever, 
and we are still greedy when it comes to getting that 
promotion. Like Randi, all we can do is be eternally 
vigilant against our own fears, hopes, and biases. 


Thank-you Mr. Randi 


Randi has left a swath of fraudsters in his path: Uri Geller, 
James Hydric, Peter Popoff, James McCormick and many 
others. Each of them represents a swindler filling their 
pockets with millions by feeding on the hopes and fears of 
thousands of people. He showed the danger of blind faith, 
and importance of protecting ourselves from our own, 
desires. 


From time-to-time I still blame software errors on 
planetary alignment, or demonic possession. Other times 1 
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‘amaze people with my psychic ability to know an error 
‘without ever having to have seen the problem they are 
experiencing, But these are done in jest, and always 
followed by (atleast the offer of) a detailed investigation 
cor explanation as to how the discovery was made. 


‘As individuals with a responsibility to achieve goals, and 
under pressure to deliver, itsis sometimes hard to hold 
urseves to account Sometimes we feel tempted to give or 

QEHEYD accept false hope to preserve our own dignity. Randi's 

23/4 approach to debunking the paranormal didnot make him 
wae friends with believers, but cu directly tothe hear ofthe 
‘pq Matter. When lives” and livelihoods” are on the line 


{© that's what really counts” 


So thank-you, James Randi. You did not make friends 
ig tone the frauds and charlatans of the world, but you 
‘© certainly inspired at least one developer to push beyond 
PS the illusion of success. 


as Boeing whistleblower alleges systemic problems with 
"737 MAX, The Seattle Times, 2020-06-18 

https: //wwwseattletimes.com/business/boeing-aerospac 
¢e/boeing-whistleblower-alleges-systemic-problems-wit 
h-737-max/ 

Twitter breach exposes one of tech's biggest threats: Its 
‘own employees, NBC News, 2020-07-16 

https: /www-nbenews.com/tech/security/twitter-breach- 
exposes-one-tech-s-biggest-threats-its-own-n1234076 
» Breach at software provider to local governments, 
schools, ABC News, 2020-09-23 

hhttps:/abenews.go.com Technology wireStory/data-brea 
ch-software-provider-local-governments~73209257 

» HSBC suffers IT outage, Information Age, 2017-02-27 
‘https: //www-information-age.com/hsbe-suffers-it-outag 
4543! 
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Allow people to make assumptions and they will come 
away absolutely convinced that assumption was correct, 
and that it represents fact. It's not necessarily so, 


Further Reading 


‘This has been a personal tribute to a great man and some 
of the things he inspired me to think about, and the way he 
caused me to see the world, Naturally, as Iwas writing 
this, Icame across some articles on perception and how, as 
humans, we want to be deceived. 


Bigveq@  NewYoritimes:sleights of mind 
BEER teenagers 
EES escription of how magicisbased in 

HEPES cognitive perception 


‘Skeptic News 
Randi's recommended daily reading of 
skeptic news sources that approach the 
world witha critical and scientific eye 


Ifyou are impressed with James Randi, you should also 
learn about 
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‘Margaret Hamilton 
‘who put quality control at the forefront of 
her teams software design, saving the 
‘Apollo Moon landing; but who's design of 
a error free programming language is 
largely forgotten 


(82 cavards Deming 
be father of data driven decision making 
EM) Selaterctnm 


and in the interest of being skeptical of Randi, and 
because still want to believe 


252 ennotnmsmni staan 
ay! ‘Challenge 
ORs 


‘an article critical of Randi's requirements for the prize, 
‘which indicates he may have used it as a vessel for 
suppressing legitimate evidence. 


passe 
a ‘The Unbelievable Skepticism of James 
Re, Randi 

Boon 


slightly more critical look at Randi's life's work. Raises 
{questions about Randi's personal bias and profit motive. 
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Git and the Intermittent 
Network 


personal experience with network failure 
‘+ Benefits and risks associated with modem digital 
platforms; Emphasis on availablity, sustainability, 
and growth capability 


‘© Reflection on the historical development of the 
Internet. 


‘+ Evolution of version control systems like Git and 
SVN to enable offline work 


on 


Within my organization, we have been moving toward 
‘modern web-based platforms. These offer many benefits 
to our users regarding availability, sustainability, and 
‘growth capability, and I have been one of their leading 
proponents. 


While modern web-based services are the norm and 
desirable, their risks should be considered for mitigation. 
‘These risks revolve around the centralisation of service 
and the network availability of clients. These risks are well 
understood, and most tools used by modern development 
teams were designed with these types of issues in mind. 
However, as the Internet becomes more pervasive and 
stable, we commonly lose sight of its limitations. 


A Personal Experience 


{As Iwite this, 1am experiencing an internet outage 


‘This event occurred mid-meeting and has resulted ina 
situation where I cannot connect to the online resources 
required to complete my corporate objectives: 


1, Noconnection to a production server to conduct 
repairs on that server (or even inspect the logs) 


2. Noconnection to Microsoft Azure to conduct 
experimental work in our laboratory environment 


3. Allforms of meeting communications have been 
cut off (MS Teams, Webex, VOIP telephone) 


4. An Outlook plugin cannot connect to an encryption 


server and is frozen because itis attempting to 
show me an encrypted email 
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5. Ican'ttake any corporate training I'm supposed to 
do or read that manual for that new tool I'm 
investigating, 


1'm completely dead in the water. 


According to my provider, a fibre-optic line has been cut 
somewhere between Halifax and Montreal resulting in 
massive connectivity loss for the region. The only 
productive task left to me isto write up an assessment of 
the current failure, on my local device, and upload it to the 
network when communication is re-established, 


Hold on... Did Ijust describe getting workdone and loading 
it later? That is a well-known caching strategy for 
resolving network latency issues, 


Historical Note 


‘The Internet, as we understand it, has not always been as 
accessible, available, or reliable as we have come to expect. 


Itis worth remembering that The Internet was initially 
designed as a distributed communication tool to allow the 
military to continue operating remote computers in the 
event of massive node loss (dating back to 1966). The 
assumption of loss of network availability has been an 
underlying assumption of much of the Internet's growth 
and is built into its fabric. 


Inthe early days of general access to internet services, 
connections to the network were made intermittently. 
"This was performed by dial-up connections, which would 
be initiated for short periods. 


‘As networks became more common and robust, much of, 
the shared development of software (open source) began 
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to be shared across the network, as opposed to letter 
carriers and print (via the Share catalogue). Unfortunately, 
internationally, not all network connections are created 
‘equally, and some users suffered from several disruptions 
to connectivity. 


‘What was Share? 


In the mid-19505, a user organisation for scientific 
applications .. was formed. One of its most important 
functions was serving as a clearinghouse for contributed 
software subroutines. The organisation was called Share 

and the contributed routines became the first library 
of reusable software. 


— Robert L. Glass, Facts and Fallacies of Software 
Engineering 


Out of this sense of sharing evolved several clearing 
houses such as SourceForge, Tigris, and eventually GitLab 
and GitHub, Unfortunately, even these clearing houses 
‘were subject to disruption. 


iat hic ag sab a re 
pose ney ae eit 
Se ame manine ten a rmscted wes 
ig arrnee tkoores ote peal atingencne 
GEES vor een sine tosand fh 
Beagrie ore ip. mache 
sie System) solutions (such as Git) evolved in thi 
eae gag echar ie ea eos 
= facies eeaknd orca tie erie 
fo? 


distributed node. 
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Inrecent years, CAP theorem has evolved to explain that 
high availability comes at certain costs, which, once the 
costs are accepted, can offer the benefits seen by the BBC 
newspaper during the Russo-Georgian conflict. During 
this period, communication lines were severed, meaning 
that correspondents and readers could not communicate 
across national boundaries. Service continued to be 
delivered to each side of the boundary, allowing reporters 
to continue reporting and commenters to continue to offer 
feedback during the entire conflict. Automated 
synchronization of news reports and on-the-ground 
reader comments occurred when alternate communication 
paths were established, 


Each of these historic scenarios have common elements 


‘+ contributors are forced to disconnect from 
communication and wait, 


‘+ contributors wish to continue to prepare their 
communications 


‘+ caching is used to overcome communication 
latency, allowing people to continue working 
locally until the connection is reestablished. 


‘This batching, or caching, can mitigate connectivity issues 
with web development platforms, 


Web Based Publishing 


Despite published content, content development has many 
‘common elements throughout its progression. Whether 
this is the dynamic content of software or the static 
content of News Videos, there is a common process for 
creating and distributing the content online. 
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‘To.use the example ofan individual publishing an article 
to their newspaper (maybe theirblog), they (the 
Contributor) would connect tothe internet 
(WetProviderA) and type theirartcle into Open Journal 
Service, WordPress, or Medium (the Server). They can 
continue typing into the software on the Server, perhaps 
running spell and grammar checks until they hit the 
publish button. At this point, the consumer ean retrieve 
the message whenever the customer wants. 


‘The software development process would be the same as 
that of web-based development tools. The contributor 
would connect to the Internet, edit their document on the 
server, and indicate readiness to publish, making the 
application available to consumers. 


Intermittent Connections 


Looking at the historical development of the Internet and 
the current issue, we can see a riskassociated with the 
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network's not being available. We cannot consider the 
server in isolation and must include the network's effects. 


Ifthe contributor's network connection is terminated, 
they cannot perform any work. Sticking to the newspaper 
article example, the author may have an excellent idea in 
‘mind, know ofa flaw in the argument, or (frankly) want to 
get some work toward the publishing deadline; 
unfortunately, they are stopped, 


“Another layer that can be considered to overcome this 
issue is the local computer, which can be used to cache 
work: the contributor can type their document on their 
local computer and save it to their local disk. 


7 


Lookingat the previously discussed intemet history, we 
can look to VCS tools to assist us in solving this problem 
‘SVN and Git (as well as their predecessors and 
competitors) were developed in an environment where 
work needed to be buffered against future connections. 
Specifically, work needs tobe performed and stored locally 
until itis possible to transmit i 


‘This has been an ongoing evolution, and Linus Torvalds 
specifically developed Git to resolve buffering issues he 
saw in SVN. 


‘This has been an ongoing evolution, and Linus Torvalds 
specifically developed Git to resolve buffering issues he 
saw in SVN. 
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Tip 


Git is not a simple upgrade of SVN; the two products 
have trade-offs. Git stores complete copies of the 
database on every node, while SVN stores one copy 
on each node. Git stores complete copies of each 
state in its database, while SVN stores a sequence 
of state changes. 


‘These differences make Git always recoverable (any 
single node can rebuild the entire system). Stil, SVN 
requires less storage space and works better when 
large binary files are involved, 


While Git is now dominant, many users of large 
binary files (Engineering Diagrams and cartography) 
prefer SVN. 


Local or Web 


Using local development tools and synchronizing periodic 
changes is a common practice that allows us to 
communicate only the changes we are committed to. Still, 
this practice also offers the benefit of resolving latency 
issues. During 2020, lockdowns have resulted in many of 
us having to work from home and being remotely 
positioned in our workspace. We are using networks 
established for scenarios that demand significantly less 
resilience (binge-watching movies) and situations that 
require significant resilience (earning income to pay for 
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groceries). This can be surprising for those of us whose 
livelihoods have become tied to these networks for the 
first time. In cases where the Internet has temporarily 
failed, and we are left unable to progress, itcan be 
distressing to our managers and ourselves. 


‘This does not mean that local tools are better than 
‘web-based tools. 


For many years, my favourite platform was Cloud, an 
online web-based IDE that allowed workstations to be set 
upon demand. This allowed me to maintain several 
development environments that met various needs. The 
ability to pick up work from anywhere in the world allowed 
‘me to continue working on projects from a hotel courtyard 
{in Ecuador, from an old, indestructible RCA Cambios. The 
ability of the vendor to supply me with a powerful remote 
computer meant I could work from a $100 computer. This 
‘means I received software upgrades immediately and 
could work from any cheap hardware I could scrounge up. 


‘There are trade-offs to consider, and that is what this has 
been about. Be aware of the trade-offs before wholly 
‘committing to one solution or the other. IDE vendors want 
you to be tied to their tool, and this reduces many of the 
benefits of distributed VCS platforms. On the other hand, 
‘we have computing networks; take advantage of them, 


Ultimately, I recommend a balanced approach that takes 
lessons from the Internet's rich history of information 
sharing, 


Use Web-based IDEs, but use generic ones. Do not depend 
‘on always having access to the vendor's editor. Instead, 
‘maintain regular local pulls from your VCS repository and 
use programming languages and data formats based on 
simple text. This allows you to switch to a local copy 
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during network outages and protects you from vendor 
lock-in. 


Further Reading 


{As ever, Wikipedia has become the place to start. I 
recommend reading the article on Version Control 
Systems. 


‘There are several generic, web-based IDEs that I have 
enjoyed using: 


‘© Theia (Eclipse Foundation) 
‘+ Eclipse Che (Eclipse Foundation) 
‘* Cloudo (Amazon) 
‘© GitPod (Git Pod) 
Interestingly, each can be served on your corporate 
network (to protect your institution's intellectual 
property) or installed on your local computer, allowing 


you to continue working when your network gets nuked, 
‘ora ship's anchor snags your data cable. 
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Somebody, please take 
my money 


‘The business of software. A guide for dummy 
programmers 


‘+ Amajor retailer could not find a way to make a sale 
despite a willing customer. 


‘+ This highlights systemic issues in the softwareand 
information system industry. 


‘+ Historic examples and modem methodologies of 
guality assurance in digital systems 


‘© Programmers, managers, and executives must 


prioritize functionality and quality over superficial 
features. 
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| spent 6 hours today trying to buy a printer .. and failed. 


‘+ There is money set aside for this purchase in my 
bank account. 


‘+ Thereare printers on the shelves 
‘+ Websites advertise printers for sale 


I could not purchase a printer because a corporation witha 
near monopoly on the Canadian market, one of Canada's 
largest privately owned retailers, could not tell me which 
printers they had available fr sale. 


Let that sinkein: 


‘A multi-billion dollar retailer could not tell me if they 
hhad anything to sell me. 


On the surface, this is laughable and funny, but itbelies a 
‘more insidious, pervasive, and downright dangerous 
problem, 


The Sequence of Events 


recently made a long-distance move across almost the 
entirety of Canada. I'l credit the movers; most of my stuff 
‘made it across in one piece; unfortunately, my printer 
suffered catastrophic snapping of some hinges. Since we 
do not deal in paper as much as we used to, my wifeand I 
{initially just tried to live without it; in the short term, this 
saved usa bunch of money, but italso helped us to 
evaluate what we really needed in a printer: 


1. Scanner with an auto-feeder 
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2. Cheaper ink cartridges 
3. Duplex printing (nice to have) 
4. Ink colours in separate cartridges (nice to have) 


After a moment of distress trying to print a government 
form, and with a reasonably clear idea of what I wanted, 
‘my Wife and I decided to end the problems and pick 
something (anything) up at a nearby retailer: Staples 
canada. 


‘Naturally, my first reaction was to look on their website: 
staples.ca, Reviewing a listing of available products over a 
cup of coffee in our pyjamas was exactly the way to start 
figuring out what we wanted, It was a quick search: 

+ allavailable printers 

‘+ ordered by price 
Jumping to the first couple of printers that met our needs 
‘would give usa good idea of our critica price point. 
Further, [also saw they had an option for 2-hour curbside 
pickup, so we should be picking it up in a couple of hours. 
We found a couple of printers, at around $100, that met 
our needs. Within 30 minutes, we clicked to order. 

‘+ None available in-store 

‘+ None available online 
What? That's annoying. Let's try the next one. 


‘+ None available in-store 


'* None available online 
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Hold up. Let's try a filter: Only show items with 2-hour 
delivery. Surely, that will filter to items available in the 
store right now. 


‘+ None available in-store 
‘+ None available online 


‘The more I searched, the more frustrated I became. After 
an hour, my wife and I finally decided to go into the store 
and buy whatever they had available 


Upon arriving at the store, it didn't take long for us to 
narrow in on the products we were looking for. They were 
‘more expensive than their equivalents online, but they 
were there. 


or were they? 


We narrowed it down to one of three printers when a clerk 
arrived. We pointed to them and asked a couple of 
questions, and he stated: well, we should probably check 
to see if they are available before we go any further. They 
‘may not have them in stock, 


‘Momentarily slack-jawed, we proceeded for him to check 
to seeif they were in stack, only to find that they weren't. 


‘None of the printers we had expressed an interest in were 
available for purchase. They were on the shelf, but none of 
them were available for sale. The clerk suggested we go 
online to check availability. 


‘None of the printers we had expressed an interest in were 
available for purchase. They were on the shelf, but none 
‘were available for sale, The clerk suggested we go online to 
check availability 
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‘As we left the store empty-handed, the manager stopped 
us and asked if we had found everything we were looking 
for (no) and how he could help. He spent the next two 
hours explaining that it wasn't his fault they had no stock 
and trying to upsell me a high-performance Laser Printer. 
‘My only question atthis point is: Is it available for 
purchase? 


"The manager gave me a customer complaint phone 
‘number and promised to email mea link to a printer Iwas 
interested in, He had verified with the warehouse that it 
‘was available, and I could get it shipped to the store. He 
could not request it, but there should be no difficulty if 
ordered it online via the website. 


X In-store Pick up nat 
available at 


‘The resuts of my attempt to order a Brother MFC-497DW fom 
‘Staples.ca 


(On my way home, I stopped at the local hardware store to 


pick up a ladder and a shovel. They didn't have any ladders 
in stock but pointed me to their website. 


© Temponnty at of sock enn 
Mier slr boc tens 


Clicking on thesia in-stock items tink on HomeHardware.ca had no 
fect tal. 


The Significance 
‘These are fundamental problems inthe Software and Data 
systems space inventory systems shouldbe ableto count 


inventory, and sales systems should be able to sell, 
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products, Businesses are allowing systems to be released 
‘without testing the requested fundamental feature, which 
isammassive failure. Software developers are allowing the 
release of features that do not work... and not in subtle, 
nuanced ways. 


It's like manufacturing a car and forgetting to attach one 
of the wheels. 


Information systems are just that: systems. They integrate 
‘humans and information to achieve a goal. In this case, the 
business itself sits inventory management processes, and 
they were completely nonfunctional across multiple paths 
to success. 


1. Novalidation of the system had taken place 
2. Noalternatives for success were planned for 


‘That is fairly significant, and I suspect there are two 
elements at play: 


1. The development was outsourced to a consulting 
firm that is motivated to appease managers. 


2. Itis highly likely that the feature was delivered on. 
schedule. 


Software development has been commoditized, and in the 
process, the managers and developers have forgotten a 
fundamental truth: software is built to solve real-world 
problems that real people have. Ifyou develop tools that 
don't work, people are hurt. Even if it isjust because they 
hhave to manually search through a store's catalogue or 
because they have to add up numbers in a spreadsheet 
before entering into the payroll system, they are still hurt. 
While it is up to developers to enact that quality, managers 
must expect quality 
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‘This means deadlines may nat be met. 


but then again, you have to askiif the feature isn't 
implemented, has the deadline been met? 


It's Embarrassing 


Since the dawn of software development, testing and 
validation of the work has been paramount. When you 
entrust a person to design a process to take care of people, 
it must take care of people. Human judgment is no longer 
involved to caver your mistakes in process design: if your 
process is flawed, the machine will carry out a flawed 
process. 


‘Margaret Hamilton realized this in the '6os when her team 
developed the software that landed the Apollo, introducing 
the concept of Software Engineer. She did not develop 
software that worked because astronauts are trained nat to 
‘make mistakes but because she tested the crap out of her 
software. Later, this led to the languages 061 and USL, 
which made process designers think about errors before 
they happened .. because that's what programmers do. 


After 22 days, Home Hardware wa unable to tell me anything more than 
my ecderis in progress. After asking mef dl ike to walt tout, we took 
‘anather 10/minutes and cancelled he order 
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Inthe late 1990s and early 2000s, unit testing, test-driven 
development, continuous integration, and automated 
regression testing came into their own, offering means of 
ensuring that features were guaranteed to be minimally 
tested for basic user expectations prior to release 


Cera) 


BS 
ic} 


2 DevOps in 2010 should have made developers even more 
responsible for basic functionality testing, making them 
directly answerable for the system's errors. 


‘The issue is likely two-fold 


4. Managers believe the ski lies in the visual 
elements they can see rather than the business 
processes they cannot. 


2. Developers are not standing up for the engineering, 
act they are engaged in. 


It was a very long time and alot of hard work before 
Hamilton's peers considered software a type of 
engineering. Watching software be released that does not 
‘met the most basic functionality requirements makes me 
realize that many developers need to live up to her legacy. 
ll take it one step further .. managers must stop 
expecting software to have the "testing" check-box tick 
and start understanding that quality is a continuous 
improvement mindset. 


It Matters 


aise this as amoment of reflection within Information 
‘Systems and Software Engineering. 


no 


% Airplanes are falling out of the sky 


Eg Small business owners are being falsely 
f accused and imprisoned 


@ 
a “2°74 Young mothers have their children taken 
from them 


Retail organizations are physically unable 


‘and. 9 sell their products 


Programmers need to consider the direct consequences of 
their actions; Managers need to take a moment and reflect, 
‘on the value they are adding to their organization; and 
[Executives need to reconsider the actual deliverables they 
are asking for. 


Turn a Shiny Dashboard 
into a Desktop App 


Because sometimes bureaucracy gets in the way 


‘= Leam how to deploy a Shiny dashboard as a 
desktop app, sidestepping the hurdles of server 
deployment and organizational red tape. 


‘+ Createa seamless user experience with a starter 
script that loads the application from a shortcut, 
‘minimizing the need for technical expertise. 


‘+ Explore how this approach extends beyond Shiny, 
enabling the deployment of various web-based 
applications that offer flexibility and ease of access 
in diverse organizational environments. 


‘Sample Code available on GitLab 
jeff Cave / shinyapp-desktop -GitLab 
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Shiny isa popular web publishing service, unfortunately, 
nat every application can be deployed on servers. This 
tutorial demonstrates a simple means by which to deploy a 
shiny app to desktop by creating a Site Specific Browser. 
‘Mostly to skip the bureaucratic begging for a server. 


sop) 


‘An R'Shiny dashboard can be run as a desktop aplication from a 
double cick ta give non technical users a eeamless experience 


‘Once upon a time, one of the Data Scientists in our 
‘organization called me with a problem. They had spent 
significant time putting together a dashboard in Rand 
Shiny and wanted to know where they could host it to 
share it with clients. 


‘They wanted to know where our Shiny server was stored 
and how they could publish to it. 


Ittook everything I had not to laugh at them. 


‘The thing I was working on when he called was a generic 
deployment system for precisely that kind of project. 
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However, Iwas running into negotiations with Security, 
Finance, and Architecture .. everybody has to have their 
say. To get him the server he wanted, I estimated years. 


Like any large organization, the bureaucracy must be fed. 
‘This was a massive blow to the Data Scientist. His team. 
hhad been developing the dashboard for months. The 
business had invested precious effort in describing its 
informational needs, The team had demonstrated the 
value of Shiny. They were ready to realize all that effort, 
and the organization's statement was: We can't do that. 
‘That's alot of wasted effort. 


After some discussion, I took pity on him and his team 
(and myself; '@ actually invested a lat of my coffee breaks 
coaching his junior Data Scientists). 


‘+ Your customer's need the dashboard now? (yes) 
‘+ _Isthe dashboard computationally expensive? (no) 


‘+ Doyou havea shared folder in which you could 
publish the application? (yes) 


‘© Arethe customers at all technically savvy? (no) 


told him to give me the weekend, and I'd give hima 
prototype solution on Monday. 


Project 


‘The intentis not to teach how to do complex mathematics 
‘or write Shiny Apps but to demonstrate how to configure a 
project within the organizational environment. Hopefully, 
this will act as a springboard, helping users set up quickly. 
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‘The code itself is a simple demo app exported from 
Studio, The real trick is to get itto run on the desktop 
environment. 


‘The expectation s thatthe developer wants to deploy a 
shared application buts in an environment with no Shiny 
server and there isnot likely tobe one anytime soon, 

IBF rather than wait, the developer can take advantage of a 

$F 95 snared folder structure and an ld feature of Flex run 

ZS SEEE the application on individual desktops. While not as 
elegant, tis solution represents a solution that is likely 
suitable for most reporting needs and can be implemented 
immediately (using the tools already present) 


Pre-Requisites 
+ Windows 
= Rstudio 
+ Recript 
+ Git 
+ Firefox 
‘The demo assumes you have RStudio installed and wl 
interact withthe system via PowerShell Theres no reason 


this will not work on Linux; however, itis not what we use 
at the office, soit was not tested on it. 


Checkout the base project 


‘To get started, clone the sample project and open it in 
Rstudio 


un 


1. Navigate: 


2. Click: Fork 
3. Open your copy of the project 


4, Get the clone URL. 


Go to the command line and checkout the project 


ed ~/Praject/Folder; 
git clone 

https: //gitlab..com/jefferey-cave/shinyapp-desktop.git; 
ed shinyapp-desktop; 

de -al 


You should seea listing of all the files in the project. 
[Before we proceed, we should check to see ifthe project 


‘runs on our computer. This ensures no basic configuration 
issues before the actual work begins. 


=O 
b RunApp ~| @~ | = 


run the application by clic 


1. Double-click on the file desktopshiny.Rproj 
2. Open: app.R 
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3, Click: Run App 


You should see the shiny app open in the built-in browser, 
though, depending on your environment, youmay have to 
resolve some dependencies, 


. 


Od Faithtl Geyser Data 


1 
od 
} whl i " 


i 


Proof thatthe application i running and any problems we may 
‘experince ae nt wit the computer configuration or app code. 


Create a Starter Script 


While knowing the application works is nice, it could be a 
better user experience. We have been asked to create an 
app for those less technically inclined, and they should not 
need an instruction manual to get up and running. 


We can ease their experience by creating a starter script 
that loads their application from a shortcut. 


‘The first thing to note is the output during the run of our 
Shiny dashboard. When weclick on the button Run App we 
see the exact R command being executed to achieve all of 
this and its output: 


us 


shiny: :runapp() 
Listening on http://127.0.0.1:5436 


‘Try copying/pasting your URL into your local browser; you 
should see the same app. 


Knowing that there is an R command that will start our 
application, we can skip the IDE and run the application 
using reeript 

1 Stop your app in RStudio 

2. Open a PowerShell terminal 


3. Change to your project folder 


4. Run your project using rseript 


rscript-exe -e "shiny: :runapp(".") 


Your app should be started, but without having to have the 
customer load the IDE: 


Listening on hetp://127.0.0.1:3145 


‘Try pointing your browser at that new URL. You should be 
looking at the app. 


19 


NOTE 


The port changes every time you start. Itis 
randomly assigned at start-time. You can specify 
the port that will be used; however, if we put 
together more than one dashboard, having a 
random port means less coordination between data 
scientists (it should just work), 


We will not make assumptions about the start 
conditions or set a static port. 


If we check the runApp parameters, there is one extra 
parameter we can include to make this a little more 
user-friendly: 


rscript.exe -e “shiny: :rundpp(".', launch. browse: 


Save that ina text file called start.pst. 


You now have a basic script for users to start your 
interactive report. Having your customers click on the 
start script will give them a (mostly) seamless 
experience. 


Creating a new browser instance 
Since Shiny advertises the por itis listening on, we can 


capture that information and then instantiate a special 
‘browser instance on behalf of the user. 
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For our users, we could start a browser instance just or the 
‘application, We also want to stop the Shiny instance when 
the browser stops using it 


‘There is no data. There is only XUL! 


(the XUL platform slogan) 


For this example, we will use Firefox, which is based on 
the XUL platform and has a well-documented and 
‘modifiable interface. To summarize (ina brutal way), 
Firefox is a webpage that can be dynamically modified (if 
you know how). We are going to use this feature to create a 
primitive Site-Specific Browser 


We can extend our shiny server starting script using 
PowerShell to listen for the advertised port. We can then 
use this advertised URL and port to start a Firefox 
instance, 


# Start “rscript’ and capture “stderr” for the port number 
& *rscript.exe” -e “shiny: :rundpp('.")° 2981 | 
a 
4 look for the “url’ Line 
at(3. 
# Parse the input for the url 


ike "+Listening on**)( 
(°$." replace *.+Lastening on ",'*).Traa()) 


open FireFox using the discovered URL 
4 "C:\Program Files\Firefox\firefox.exe" $_ 
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‘This solution gets us part way there: we are starting a 
unique browser instance for the shiny app. 


‘The issue is that when we terminate our Firefox instance, 
‘our Shiny instance continues to run in the background. We 
‘must manually stop it 


We can continue to modify our script to start both the 
Shiny dashboard and the Firefox instance separately, then 
allow our script to maintain enough intelligence about 
them to monitor their independent process states. 


# Start the ‘shiny* “thread” 
Sehiny = Start-Job -Nane “shiny” -ArgunentList( Spm) 
“SeriptBlock{ 

param(Sworkingéir); 

ced $workingdir; 


# Start the shiny app and print the URL 


A "recript.exe” © “ahiny::rundpp(’.°)" 2981 | ( 
Af(S_ “Like "*Listening ont"){ 
(°S." replace ".*Listening on 
, 
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Atthis point, Shiny is started asa job, and Sshiny 
‘maintains a reference that can then be used to stop the job 
later. This does ada the problem that we need to read from 
the output stream slightly differently. 


poll the shiny thread for output 
while (Sshiny-HastoreData -or Sshiny.State -eq "Running") ( 
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Surl = Sshiny.childJobs[6] .cutput..readall(); 
4 when we find the URL... stop 
af(Surt)( 
break; 
, 
, 


"Now, we have the URL at the script level and can proceed 
to start Firefox. 


‘Again, we want to create a separate process that we can 
monitor. 


# create an array of argunents 

Sargs © 0('-profile’,"./profile',,"-new-instance' ,“-url 
‘Suel""); 

# start the firefox instance 

SF © Start-Process "C:\Program Files\Firefox\Firefox.exe” 

“ArgunentList Sargs -PassThru -Nait 


‘This will block the script until Firefox stops. 
Pay close attention to the arguments passed to Firefox. 


‘+ profile: This uses a pre-existing profile that is 
‘customized to our purposes. 


‘+ new-instance: Ensures it does not re-use any 
instances of Firefox that may already be open 


‘+ Wait: Ensures that the PowerShell job blocks 
processing until it ($F) completes 
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‘This forces anew-instance, anew profile, anda wait 
until Firefox completes. The custom profile is used to 
‘manipulate how Firefox appears to the user. For the 
adventurous, inspect the included profile to see ways you 
‘can manipulate Firefox to make it behave more like we 
want, 


ur final step is to stop the shiny process once the Firefox 
process has terminated. 


Stop-Job Sshiny.Id 


‘The completed script looks like this: 


Sehiny = Start-Job -Nane “shiny” -ArgusentList( Spr) 
~Seraptslock{ 
pparan(Sworkingéir); 
ced $workingdir: 
4 "rscript.exe” -e “shiny::rundpp('.")" 2981 | % { 
a8. 
(°&." replace *.*Listening on *,°*).Teéa() 
, 
, 
) 


ike “*Listening on*"){ 


while (Sshiny-HastoreData -or Sshiny.State -eq "Running") ( 
Surl = Sshiny.ChildJobs[6] output .readall(); 
af(Surt)( 
break; 
, 
) 
Sargs © 0('-profile’,./profile',,"-new-instance' ,“-url 
‘Suel""); 
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SF = Start-Process "C:\Program Files\Firefox\Firefox.exe” 
“ArgunentList; Sargs -PassThru -Wait; 
Stop-Job Sshiny.14; 


‘You should be able to run the script. 


-\run.pst 


and (eventually. it's a litte slow) see your app running in 
awindow. 


* 
Old Faithful Geyser Data 


‘our dathboard running asa standalone desktop application. The con 
can be changed by mealfyng the fies in the profile folder. 
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Creating a Shortcut 


One of the issues with creating a PowerShell script is that 
regular users can't run it without abit of know-how. The 
‘easiest way to get around this isto create an old fashioned 
at file. 


ed <working directory> 
(C:\Windows System32\WindowsPowerShel1\v1 .@\powershel1. 
exe ".\run.pst” 


Sure, Microsoft asked us to stop using that in 1995, but 
wwe are already in the depths of "get things done" 


You will likely get an error if you run this as a regular user. 
‘The problem is that Windows (in their infinite wisdom) 
‘makes scripting unavailable to users by default. This is to 
protect them from malicious scripts. 


‘To activate the script, you must indicate you know what 
you are doing. Since we only want our users to see the 
desktop window, we may as well hide the console window 
while we are at it. 


ed <working directory> 
(C: \Windows\System32\\WindowsPowerShell\v1.0\ponershell.e 
xe -windowstyle hidden -executionpolicy bypass 
"\\run.pst™ 


‘+ executionpolicy: allows the script to be run 


26 


‘© windowstyle: will enable us to hide the terminal 
window 


Since this is a desktop application, a shortcut fle (LNK) is 
better than the above BAT file. These allow us to specify all 
the same parameters and an icon file while removing all 
the console windows. 


Temnat ot secuty—_Datals_revousersons 
Canal SPORE Opler Fant. Laynt——Cales 


GQ 00007 


Tepstype — Aepeaton 
Tepatoceion vio 


Tevst ——_naowahahison-cetenpoley byes" en” 


QoenFlsLocton | Ohangeleon. | Adaneed 


“Thelink setings dialogue, showing it ied in and withthe icon set 


1. Navigate to the working directory in Windows 
Explorer 
2. Right-Click» New.» Shortcut 
3. Setthe properties 
© Target 
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C:\Windows \System32\WindonsPowerShel 
1\v1.0\powershell.exe -windowstyle 
hidden -executionpolicy bypass 
-\run.ps 


© Startin 
working directory> 
© Change Teon 


By setting those three options, your users are a double 
click away from a reasonably seamless desktop experience. 


A Happy(ish) Colleague 


‘The lead for the Data Science team I was working with was 
(reasonably) happy with the solution. It was a hack, but it 
got his team up and running in a matter of days. 


We both agreed that the optimal solution was to get a 
Shiny Server installed on-prem and link the URL from the 
internal website, so put him in touch with the correct 
procurement experts and gave him this solution. I don't 
Jnow what ever came of the procurement. 


We refined the team's solution (mostly automating the 
deployment to the shared filesystem from GitLabs C1/CD 
features), but for the most part, the above solution 
represents a quick and dirty way for Data Scientists to get 
their work in front of data-driven decision makers. 


‘To this day, the internal website maintains a link starting 
with file: /// that points to the shared network file 
system, 


8 


For those paying attention, this solution is not constrained 
to Shiny but to any served web application: Node, Python, 
‘or perhaps something tucked away in a docker instance. 
‘This can also constrain a user to a web-based application 
to prevent students from cheating in a test or keep a 
temporary labour pool focused on their tasks. 


‘make no claims that this isthe right solution; what 1 
suggest is that its feasible. Any organization that can 
build a Shiny application also has the tools to implement 
this solution. I share this solution, hoping it helps another 
Data Developer when Bureaucracy gets in the way. 
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How fast is fast 
enough? 


Arambling discussion of the implications 
of Real-Time data 


‘My claim: real-time is anything faster than a change can 
be observed 


‘+ Understanding Real-Time Data: "Real-time" 
means different things in different contexts, 
defined by the observer's ability to observe the 
change 


‘= Everyday Impacts: Critically think about how fast 
information is required to make decisions. 


‘+ Finding the Right Balance: Explore practical 
examples of matching data processing speed with 
human needs, 


Years ago, I got into a lunchtime discussion of real-time 
data processing, and a couple of guys atthe table started in 
with a macho attitude: 


1. Tused to work on fighter jets... real-time is, 
microseconds 


2. Tused to do nuclear weapons testing... real-time is 
nanoseconds 


3. Tused to do solar flare warning systems 


{As the time scale increasingly reduced, I ealized that my 
perspective differed. I came from healthcare setting that 
{involved patient charts. In my head, inter-hospital patient 
transfers were the shortest timescale for transferring 
information, which involved humans reading and 
{interpreting textual information. In this case, the 
information bottleneck was the time it would take for the 
patient to arrive at the new site and the staff atthe new 
site to read the chart. (1~2 hours, sometimes up to a shift 
change) 


One of the other developers at the table shared my slower 
perspective. In a previous lifetime, she had developed 
automated terrorism threat assessments and resource 
deployment systems (at least, this is true in my head ... she 
was always a little vague about what she had done 
previously). She was doing push notifications, but her 
bottleneck was the time it took for humans to comprehend 
the information they had received and to strap on a rifle. 
She put real-time at 15~30 minutes, 
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Timestamps 


How we think about time is often built ona lifetime of 
assumptions. 


Insome database systems, there isa datatype known as a 
timestamp. A timestamp is a sequential number applied to 
the system. Itis not a date or time but a pointin time, 


‘There isa very poignant scene inthe TV show Angelin 
which Lorne the karaoke hosting daemon) is counselling 
someone after a breakup: “I can hold a note for along time 

‘But eventually, that's just noise. I's the change we're 
listening for... at's what makes it music.” 


have often extended this point: The change of state 
defines time; time itself isa perception of a changing state. 
‘This may be an oversimplification, but in terms of 
‘managing data, itis a useful one. From our system's point of 
view, there has been no change, and therefore, no time has 
passed. Our system perceives time differently than we do, 
so it is natural that it should use a different convention for 
recording time. 


Interms of timestamps, we should have point-in-time1, 
then point-in-time 2, followed by 


‘Time is only observable at its smallest division; time's 
‘smallest division is the point of observed change. 
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Cbseraton a charge defines tne A pant made mos cious by Hat 
Mat end the Zoopanscne TOE IMG: Waimea COSA 25) 


Real-time happens when change occurs between the 
observable points, so itis available atthe next observation 
point (or possibly even creates the following observation). 


Human Speed 


aus Inhigh school, a friend and I discovered Network Time 
-25 8 Protocol (NTPY" and the Canadian Atomic Clocks”. While 


> REC-1305, Network Time Protocol (Version 3) 
hips: //datatracker ietforg/docyhtml/ric1305 
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reading the user manual by the National Research Council, 
Tremember reading the request to not use the most accurate 


It was just a friendly request to be polite. 


"The way the system works is that depending on how: 
accurate your needs are, you are supposed to use a 
decreasingly reliable stratum of service. Stratum 1 sits 
right on top of the atomic clock, Stratum 2 servers update 
from Stratum 1 (introducing some potential error), and 
Stratum 3 servers update from Stratum 2 servers 
{introducing some more potential error). So, at the time, 
you were politely asked to use Stratum 3 servers to avaid 
overloading the Stratum 1 computers. 


Seems fair 


Unfortunately, people are people, and my friend began 
updating his analogue watch (readable ta the minute) by 
hrand with Stratum 1 servers. When I told him Stratum 
three (accurate in the range of milliseconds) was good 
‘enough and that he was decreasing the accuracy for 
everyone, he boldly told me, "Nope, only Stratum 1is 
accurate enough for me." 


hnttps://nre-canada.ca/en/certifications-evaluations-stand 
ards/canadas-official-time 
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‘You are supposed tous the atratum of service appropriate to yout 
needs, Stratum 1 sts right ontop ofthe tom clock, with each stratum 
reading trom the previous level (MG: Wikimedia: Pubic Domain) 


J AZAR 


‘Macho statements aside, itis obvious that the speed 
bottleneck in the system is human-scale, not 
computer-scale, let alone atomic. 


‘= A S10 watch will not retain accurate time for along 
time, The drift in the networks less than the drift 
of the watch itself 


‘+ Ananalogue interface, interacting with a human 
eye, is going to have a reading accuracy that is, 
sub-minute, which is perfectly acceptable for the 
human turning the knob, who cannot achieve 
better than sub-minute accuracy anyway. 


‘= The consumer is trying to promptly attend classes 
and meetups at the coffee shop. The time it takes to 
physically navigate the space between these events 
introduces variance at the sub-quarter-hour level 
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Society does not function at the microseconds. 


tthe time, the norm was to leave approximately 15, 
minutes to account for the vagaries of life. This is 
consistent for two people suffering from 5 minutes of 
error: 10 minutes of error plus 5 minutes of agreement. 15 
minutes was real-time. 


‘My friend's effort to achieve a smaller resolution was a 
complete waste of time. At the same time, other people 
‘were genuinely being harmed, 


‘+ Data Scientists relying on accurate time for 
‘weather modelling are hurt. 


‘+ Sailors at sea, getting lousy weather predictions, 
are hurt. 


‘© My friend gets hurt when everyone is annoyed at 
him for being late because he was fiddling with his 
‘watch to make it accurate. 


Faster than Observed 


rarely see anyone reading data more frequently than 
daily. Even with push notifications, I receive a text 
‘message telling me to take action, but I'm happy if lean 
take action within a quarter business day. 


Note 


Faster is still beter than slower, and real time is. 
faster than the change can be observed. 


BT 


Ina business setting, an executive, manager, or business 
unit often calls for a monthly report on the 3ist; weekly 
reports are run on Friday. 


‘This is a huge mistake; we can do better given automated 
systems. Real-time is still something we can strive for, 
and a key benefit s that the results don't change 
significantly if we achieve faster than observable rates. 


19 since 2010, havehada tradingbo developed in Google 

PIES soeand asap mea andi ie 

HGP competing power) tht sends me avert meaage any time 

SERIES rede make atode Te atest dat reseve 
{5-minute delayed pice dae, but the mos significant 
dora reesveepublshed query (inanelal Report) 
"Ths data must be aggregated nt averages and deviations 
and. paterns‘Thespsem is attempting to erabih 
normal, and normal (by definition) doesn't change by 
‘auch 


‘This is true for most human-scale systems most of the 
time. 


Under these conditions, we are dealing with aggregate 
data, The changes are aggregated into averages over days, 
weeks, or even quarters. The assessments are not going to 
change significantly on an hourly basis. This means that 
the evaluation from yesterday is about the same 
assessment I will get today. Imay see a change in the 
general trend, but itwill be subtle and non-actionable in 
the short term, 


‘There is a massively beneficial implication to this. 


» Google Apps Script is a Javascript instance that runs in 
the context of Google applications. This gives alot of cloud 
‘computing power to you. 

Inttps:/ developers. google.com/apps-script/overview 
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4. If producea weekly report that interprets and 
advises the business, I should run the report daily. 
‘The average of 7 days of business operations is 
likely similar to 6 days. The results and conclusions 
of a report produced on Thursday will likely be the 
same as those produced on Friday. 


2. Ifthe Friday run fails, Thave my conclusions from 
‘Thursday. If my Thursday run fails, Ihave early 
notice that the Friday run is likely to fail. 


[By working at one unit finer of granularity, you have given 
yourself lead time on potential issues and created a 
fall-back plan in case of catastrophic failure. 


Inthe past, this has resulted in 


1, ~3000 employees (myself included) getting paid 
for 13 days instead of not getting paid at all, 


2. AJIT system (a life-safety service) being able to 
estimate demand early so that key staff could 
attend a funeral 


3. Countless times, did not have to do overtime 
because a combination of poor null handling and 
‘weird data caused fails, but (thankfully) days in 
advance. 


‘These are the same principles from my High school Math 
and Physics classes: use one decimal place mare to 
calculate than you report. In business reporting, if you are 
tracking dollars, calculate in cents; if you are tracking 


cents, do your calculations in fractions of a penny. 


‘© Microsoft has a datatype known as “money”. This object 
isan integer value but maintains 4 decimal places. [have 
used this to good effect to determine that my maths are 
absolutely correct to the penny when questioned. 
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Push and Pull 


Push notifications change the playing field. Rather than 
updating the data on a schedule, we advertise changes to 
interested parties, However, we ate still onstrained to the 
response time of our slowest abserver. Even in the case of 
nuclear blast detection, the point was to log and collect the 
data for interpretation by humans at a later date. 


‘Take my trading bot: It sends me text notifications almost 
{immediately (magnitude of seconds). This is faster and far 
‘more convenient/reliable than me checking once a day. 
However, it does not mean Ican respond any faster. 
Real-time is constrained by the speed at which I receive 
the information and the speed at which Lean respond. If 
am trapped in a meeting, a secure network environment, 
‘or up to my waist in a river while fishing, I may not be able 
to initiate the trade for a couple of hours, 


Lazy Loading Improves Net 
Performance 
Soif, inmost human) cases, itis sufficient to deal with 


timescales of minutes or hours, then we can conclude that 
the reports do not need to be updated any faster, 


Ingeneral, updates do not need to be generated more 
frequently than they will be consumed by the observer 
(cither digital, human, or system). This idea is where push 
notifications can both help and hurt us. 


Polling (regular pulls) of source systems generates 
needless processing effort. Requesting information comes 


hnttps://learn.microsoft.com/en-us/office/vba/language/re 
ference/user-interface-help/currency-data-type 
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ata processing cost, and both systems need to expend 
effort talking to one another. Ifthe system does not 
change, all ofthat effort results in no change. 


Push notifications allow us to reduce this overhead by 
having the source system transmit change notifications to 
interested parties if something changes. This means that 

processing is performed when something needs updating. 


However, given our sub-hour threshold for real-time, we 
‘may receive notice of change more frequently than we 
need to report it, and we may end up recalculating a report 
‘more frequently than it can be observed. 
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"This was most evident in a simple web app I recently 
worked on. I wanted the user to receive real-time 
notification of the correctness oftheir entry into the form, 


Every change to the form results ina change event 
processed by the back end. I followed the error 
notifications from the validation and tried to type ina 
valid value to observe the update coming back to the form, 


‘Iwas being driven crazy. 1was only 20 characters away 
from a positive result, but I kept getting stopped by each 
keypress as it recalculated the validity. Each key press 
pushed a notification tothe backend, triggering a 
recalculation... bt I could already see Iwas wrong. 
Internally, l was begging the system to just let me finish 
typing. 


Inthe end, I put atimer on the validation: do not update it 
‘more frequently than every 23 milliseconds, That little bit 
of delay allowed me to finish typing, and the quality of the 
feedback did not suffer (maybe even improved) by 
bringing it to the human scale. 


‘Buffering results until someone wants them takes us back 
to lazy loading. If nobody will read your data, don't bother 
calculating it. This reduces overhead because you may 
have 20 updates but only one view (and, therefore, 
calculation) 


If you are speaking about push notifications, you should 
think about lazy loading. A push notification can be used to 
notify our system that it needs to update, but we can use 
Lazy Loading to defer that processing untilit is needed. 

But it isa balancing act. We can defer pracessing, but we 
also need to balance it with performing it frequently 
enough, 
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Conclusion 


‘Most of us are not attempting to detect the oncoming wave 
of a nuclear blast, nor are we racing ahead of a Solar Flare 
about to destroy the multi-billion dollar data on our 
international network infrastructure. Most of us operate 
ona timescale of minutes or hours, well within the 
‘operating tolerances of even the mast basic of desktop 
computers. 


Given this, we need to remember two key points 
1. Time scales dictate real-time 


2. You should always be processing one unit of time 
smaller than will be consumed 


We can scale our response time to an appropriate level 
through lazy-loading and push notifications. An aircraft 
‘trying to stay airborne requires a scale different from 
inventory management in a retail organization, 


‘Inundating humans with data does not improve 
information uptake. 


Having said that, we want to keep ahead of our audience. 
We can deliver information faster and more frequently 
than people need it, there is therefore no reason to have it 
standing by ready for them when they want it. There isno 
reason for us to not do our checks and balances wel in 
advance. 


‘The key is balance. 
Don't let sales tactics (and your own ego and machoism) 


‘make you forget that every solution has its own set of 
drawbacks and is subject to the law of diminishing returns, 
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At some point, we need to recognize that the problem is, 
solved and the solution is good enough. 


Once real-time moves past observable, getting faster is 
wasted time, 
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Really Simple, Simple 
Messaging Service 


Sending Text Alerts for Dummy 
Programmers 


‘+ Simple Messaging Solution: Learn how to set up 
‘SMS notifications using email-to-SMS services 
from North American cellular providers. 


‘© Cost-Effective Approach: Discover a 
budget-friendly method for automating alerts 
‘without the need for expensive third-party 


‘© Practical Implementation: Follow step-by-step 
instructions and code examples to create your own 


‘SMS notification system using Google Apps Script 
integrated with Google Sheets. 


LiseE@ Sample Code Available 
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[Before I began my journey in the field of Software and 
Data, I started a career in Nursing 


| spent many days working in a hospital with the ward's 
pager stuck to my belt. The point of this pager was that if 
there was an emergency, all staff could be recalled to their 
‘ward to assist with the emergency. Even though most 
people owned a cell phone, pagers were passive radio 
devices that were safe to carry around sensitive 
equipment. 


ast-forward a couple of years to my first job as a Software 
Developer. Working for a small consulting company, it was 
only a short time before I was pressed into the on-call 
rotation and handed the on-call cell phone. The point was 
that when something terrible happened, the customer 
representative would call you to fix problems that 
customers experienced, Often, remote login and restart a 
crashed service. 


occurred to me (kind of obviously) that we could 
automate the system checks and, just as I experienced in 
the hospital, send a notification out to pagers, potentially 
resolving issues even before the customer noticed a 
problem. (Iknow, revolutionary thinking) 
‘There were two arguments against this idea 

4. We would not catchall the problems 


2. Itwas really expensive, 


Ittook everything I had to keep calm about the first point, 
bbut my glare communicated my opinion of it effectively. 


‘The second point was very valid, 
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Phones capable of receiving email without a WiEi router 
were a long way away (or prohibitively expensive). Pagers 
‘were cheap enough, but implementing an automated 
pager service via the APIs was wildly pricey. I don't 
remember the exact cost, but I remember looking at the 
amount, comparing it to my pay cheque, and deciding I 
‘would rather myself and my co-workers get paid. 


It was one of those problems that always irked me enough 
that I thought about it but not enough for me to chase it 
down. 


About 10 years later, Iwas on contract with a major 
university, working on their communications platform, 
We were running some tests on the website's emergency 
banner, and I joked that we should implement an SMS 
service for students. 


With a twinkle in his eye, the lead developer said they 
already had and let me in on one of the industry's greatest 
secrets: a free SMS service is already built into all North 
‘American cellular provider services. 


‘Lam going to show you how to create your own SMS 
notification system, 


While I have used this technique several times over the 
years, the only implementation Istllhaveaccesstois GMS) 
Google Application Script (GAS)".GASisaJavaScript ‘gt 
{implementation used to enhance documents in the Google ii; 
Ofc Suite Oneoftheatvantgesof usinganoniine OEE? 
Spreadsheet was that it had scheduled tasks, and the 

computer was always on. Effectively, itis a cheap 

computing engine. All samples will assume GAS. 


© Google Apps Script 
Inttps:/developers.google.com/apps-script/ 
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The Trick 


‘The really simple trick is that service providers in North 
‘America all provide an email to SMS service. Every phone 
‘number in North America has a corresponding email 
address, 


‘The service providers do not widely advertise i, but itis 
not hidden either. All that needs to be done is to send a 
carefully crafted email to the phone number's email 
address. The email address domain will depend on the 
provider: 


Bell Mobility tatbellea 
Fido fido.ca 

MTS text mtsmobility.com 
Sasktel sms sasktel.com 
Telus rmsg.telus com 
Virgin ‘vmobile.ca 


Given this information, the phone number 
(493) 123-4567, and knowing that the carrier is Shaw, 
email to 40312345676txt.. chawmobile.ca, 


Different providers satisfy this in different ways, and 
when Ihave changed providers, I have found receiving 
‘messages without jumping through some hoops difficult. 
‘To confirm itis working on your phone, send yourself a 
text message from your email 
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Sending Notifications 


‘The alert systems fay straightforward 


‘The system is pretty straightforward. Itis designed to 
identify the occurrence of an event and notify alist of 
people, Traditionally, sending an alert email to 
administrators has been a common and fundamental 
requirement, Mail Merges are an even more venerable 
process 


(ur system will go through three phases 
4. Identify that notification must be sent 
2. Identify who to send the notifiation to 
3, Send the notification 


‘Many details need to be handled to do this correctly, but 
for the most part, these are the three phases of our system, 
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‘Add Trigger or ReaySimples tS 


Your system would naturally decide to notify: it could bea 
system failure, a patient calling for help, or an inbound 
solar flare. For demonstration purposes, we are using a 
change in the status of a cell block in a spreadsheet that 
will be polled every 15 minutes. 
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Once we have identified that a notification should be sent, 
‘we need to find the appropriate people to whom it should 
be sent. 


const book = SpreadsheetApp.getActiveSpreadsheet() ; 


function CheckNotify(){ 
Let config = book.getSheetByNane( ‘Config’ ); 
Jet notify = config.getRange(1,2); 
Af(notify(0}(0]) 
notify. setRange(| [false] ]); 
Notify(); 
) 
) 


function Notify() ( 
Let subs = book.getSheetByHane( "Subscribers" 
let emails = subs 


-getRange(2,2, subscribers .getLastRow(),2) .getValues() 
smap((d)=>{ return “$(d{0}}eS(4{1]}°; }) 
-filter(d=>(return d !== ‘@";}) 

Joint") 


Nai1App. sendémail({ 
to: ‘exampleeexample.con’, bec: emails, 
subject: ‘PANIC! ! 
body: ‘Please, take appropriate action.’ 
YD: 
) 


‘To identify them, we simply look up alist of emails from 
the subscriber table and loop through it. 
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‘The email willbe sent From the spreadsheet owner. Notice 
the use of 8€C, which is a good practice to balance the 
‘number of emails sent and protect the privacy interests of 
the subscribers. Alternatively, send an email to each 
subscriber individually 


It is often helpful to send to both text and email where the 
‘email version has a table of actions that must be taken 
(using the htalfody instead of body). This lets you receive 
quick notice that you have to get to a computer and get 
‘more details once you are there. 


Untrusted Subscribers 


‘This tiny process is fine for a team looking for 
notifications ofa failed system, the build has been 
completed, or the automated tests failed. In fact, adding 
your phone number and figuring out who your provider is 
‘sounds like an excellent initiation task as part of 
onboarding new members. 


actually, that's a really good idea. Imay have to update 
the onboarding documentation at work. 


(na small team, everyone is trusted and responsible for 
adding themselves. The lst is small enough to manage by 
hand. To some extent, you don't have to worry about 
people maliciously adding their ex's phone number or 
later claiming they did not request the subscription. 


As systems get bigger, they get more complex. 


Let us change the purpose of the system. Let us assume we 
are detecting a Zombie Apocalypse. The wider public may 
be interested in hearing about it. In this scenario, we could 
add a web form that allows people to subscribe to our 
Zombie Apocalypse Alert system. Of course, not everybody 
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believes in zombies, so we need to confirm that 
subscriptions are from people who want them. 


‘registration process is possible that verifies ownership of the phone. 


‘The most straightforward mechanism for verification is 
just toask the person, and it has been the most common 
‘means since email notifications were thing, We can ask 
them for their phone number (perhaps through a web 
form), associate that phone number witha secret, and 
then send the secret to that phone number. If they give us 
the secret, they indicate that we won't be annoying them 
by sending messages in the future. 


‘This exact mechanism works for mailing addresses, email 
addresses, or any other form of location authentication 
(like OAuth) 


function CheckForNawtunbers()( 
et submits = book.getsheetByHlane( ‘Subaitted' ); 
et verifies = 

submits. getRange(2,1,submits.getLastRow! ),3); 
let values = verifies. getValues(); 
for(1et verify of values) ( 
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Af(verifyl@] === '") continue: 
HandleNewunber(...verify); 
verify.f111('"}; 

> 
verifies setValues(values) 
, 


function HandleNewunber (name, phone, provider) 
17 generate a verification token 
et onetine = Math.round(Math. random )*9999) ; 
11 make it 4-digite 
onetine = "00008 onetine) 
split("") .reverse() .s1ice(0, 4). join(") 
17 set a one hour timeout 
et timeout = new Date(); 
‘timeout .setHours(tineout .getHours()+1); 
11 send the token to the person 
Nailapp. sendemail({ 
‘to: “S{phone}@$(provider}” 
aubject: ‘Code + onetine 
body: "Your onetine code is ' + onetine 
» 
// save the values to the pending table 
et pendings = book.getSheet8ylane( Pending’ ) 
Jet row = pendings.getRange(pendings..getLastRow()+1,1,1,5) 
row. setValues( {{nane, phone, provider, onetime, tineout] |); 


‘The onetine token is a randomly generated number that is 
large enough to be somewhat distinct and small enough 
that it is convenient for a human to enter. A timeout of 
fone hour ensures it fails if no one actions it immediately; 
an exploiter cannot use a forgotten code. once we store the 
token and share it with the target device, we have a means 
‘of knowing we are communicating with the intended 
audience. 
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Once the subscriber comes to us with their identifying 
‘number (their phone number) and the token we shared 
‘with them, we have a way for them to demonstrate they 
want to continue. 


function VerifyToken(phone, token) { 
1] grab a time stamp that we are going to use 
Let now = new Date(); 
1/1 find the “pending” table 
Let pendings = book.getSheetByName( 'Pending'); 
endings 
endings .getRange(2,1, pendings.getLastRow(),5); 
Let rows = pendings.getValues(); 
for(1et row of rows){ 
1/ Af the row has expired 
Af (now > rowl4]) { 
1] delete the row's contents 


row.fil1("'); 
1 and ignore it 
continue; 


, 
1/ Af phone number and token match the row 
Af (row[1] === phone && row[3] === token) { 
// put it in the subscriber's table 
Let subs = book.getsheetByName("Subseribers'); 
subs = subs. getRange(subs.getLastRow()+1,1,1,3) 
subs. setValues(row.slice(,3)}); 
row. fi11(""); 


) 
) 
endings .setValues(rows) ; 


) 
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It is important to remember this demonstration example 
hhas been simplified for understanding. Several things can 
be adjusted to improve security: 


‘+ the random function could use a strong 
cryptographic version 


‘+ the sizeof the token could be increased, making it 
harder to guess 


‘+ the timeout could be shortened, decreasing the 
number of guesses that can be attempted 


Further improvements are possible, such as removing 
expired token rows (to prevent the database from growing 
too much), controlling the number of instances of asingle 
phone number that are stored (to prevent a DDOS attack), 
placing a time limit between attempts to validate 
(reducing the number of guesses possible). Also, 1like 
using Date.now() rather thannew Date() because 
dealing with an dnt is more accessible for me to wrap my 
head around when it comes time to step through the code. 


Detecting the Provider 


Ifyou are reading this article, its ikely because you did 
not know about this service. I's, therefore, reasonable to 
expect our general public subscribers to not know about it 
either. Asking them to remember their exact provider and 
distinguish it from similar names (vmeb1 or 
virginnobi1e) and finding it in a drop-down are 
inconvenient. 


It would be a vast improvement if we looked up their 
provider automatically, and given a list of carriers, we can. 
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ur new process simply generates a unique one-time code 
for each provider. By attempting delivery to every provider 
and then waiting to see which one of the tokens got 
through, we can determine which provider the subscriber 
isusing, 


function CheckForien\unbers() ( 
et subaits = book.getSheetByNane( 'Subaitted' ); 
submits = submits. getRange(2,1, submits.getLastRow(),3); 
Let values = submits. getValues() 
Let providers = book.getSheetByHlane( Providers! ) 
Providers = providers 
getRange(2,2, ubmate.getLastfon(),1) 
getValues(); 
for(1et verify of values){ 
af (verdfyl0] ) continue; 
for(1et provider of providers) ( 
verifyl2] = provider( 8] 
HandLeSubmit(...versfy) 
, 
verify. 211 
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> 
submits setValues( values) ; 


> 


Conclusion 


Lam unsure if this email-to-text capability is dictated by 
law or is just a custom that evolved aver time, but itis, 
certainly there for people to take advantage of. Rather than 
sending signals toa costly service, itis possible to send 
signals to cell phones cheaply. 


One of the key advantages is that the infrastructure to 
build this is already available: email is ubiquitous. While 
the example was demonstrated in Google Application 
Script, Ihave implemented this solution in Java, Python, 
‘SAS, Powershell, and Bash. I is only limited by how your 
system sends emails. 


Recently, I helped a team implement this to notify if one of 
their desktops was tured off. They are using their 
desktops as a linked network for distributed computing. 
Everyone working from home needed a way to be notified 
if computer in the network got tuned off. Each computer 
in the network will now send a notice if one of its peers is 
tumed off. 


Ifyou are building a mission-critical system, I commend 
using a consistent and well-supported API. Various 
providers offer various interpretations of the emails, 
leading to some "interesting" messages being received. A 
well-supported service mitigates this issue. 


However, if you are a small team, just trying to get a 
prototype in place, frustrated with bureaucracy getting in 
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the way, or just like to tinker, hopefully, this will get you 
started, 
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Standard Disaster 
Scenarios Your 
Business Needs to 
Prepare For 


Planning for things to go wrong, for 
Dummy Programmers 


Critical Scenarios: Delve into four overlooked 
disaster scenarios crucial for system 
preparedness, offering insights for even the 
‘most confident programmers. 


Realistic Challenges: Explore scenarios like 
"Under the Bus" and "Bump on the Head," 
shedding light on potential disruptions beyond 
typical considerations. 


Practical Solutions: Gain valuable strategies 
and objection-handling techniques to fortify 
systems against unforeseen events and ensure 
resilience in adversity. 
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Inany system design, there are several scenarios that 
should be considered to prevent system failure. Each of 
these scenarios describes a worst-case scenario that 
frames planning for catastrophic events. 


Often, when describing the need for various emergency 
protocols, the presenter faces resistance, but we trust each 
other. These strawman arguments distract from the 

2 genuine underlying risk that needs to be addressed. 


"These descriptions and titles are meant to givea 
standardized response to the mast common objections. 
Each scenario has alist of ways it presents in the real 
world. The titles are humorous to ease the tension, but the 
scenarios are serious and realistic. 


‘The scenarios are also meant to be non-specific. Rather 
than planning for specific events, general scenarios 
encompassing general responses allow for adaptation to 
multiple considerations, 

1. Under the Bus 

2. Bump onthe Head 

3. Spiked Drink 

4, Sword of God 


Daemonic Possession 
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Under the Bus 


‘The primary on a system got run over by a bus on the way 
to work and has been hospitalized for an indeterminate 
amount of time. 


Presentation 


Any unavailability ofthe system experts, potentially 
combined with the need for action 


‘© Accident: sky-diving, home repair, car accident 


‘* Vacation: phoning people while they are on 
vacation is rude 


‘+ Ilness: myocarditis, kidney stones, haemorthoids, 
‘common cold 


‘* Arrest: sometimes people get detained; rightly or 
wrongly 


Objections 
That's a horrible thing to say 

fit makes you feel better, they are going to be OK, but 
accidents happen in life. o you really want tobe the 
person who is phoning a colleague while they should be 
resting in the hospital? 

We had better make sure staff didn't do anything risky. 
Go review that with your HR department. Informing 


‘employees that they are not permitted to have hobbies 
outside work is dangerous. 


165 


Bump on the Head 


One of the trusted individuals has recently received a 
‘bump on the head and now has a brain injury that has 
drastically altered their personality. They can no longerbe 
‘trusted, It is unclear how long they were trusted when they 
should not have been. 


Presentation 


‘+ Blackmail: or possibly bribery, where an outside 
actor has altered the state of the trust relationship 


‘© Poor trust evaluation: You shouldn't have trusted 
them in the first place 


‘+ External system breach: a trusted individual has 
had their digital identity compromised. 


‘+ Anactual bump on the head has caused people's 
personalities to dramatically change” 


© Objections 
It's OK, I trust you. 


‘This statement exposes employees to risk, placing an 
unfair responsibility on them, 


‘= Why psychopaths are so good at getting ahead 
|nttps:/www.enbe.com/2016/11/18/why-psychopaths-are 
~s0-good-at-getting-ahead html 

‘© Phineas Gage is the most famous “Bump on the Head” in 
history 

‘https: /fen.wikipedia org/wiki/Phineas_Gage 
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‘The minute something goes wrong, employees should 
hhave evidence that they were acting within acceptable 
‘parameters and that the managerial staff had accepted any 
risks associated with the action. If udgment calls were 
required, and bad things happened, employees need to 
hhave a clear line of approval in place that they can point to 
as having failed (justifying their taking action) 


Spiked Drink 


‘The trusted individual stands up from lunch and realizes 
they are feeling wobbly. Someone spiked their drink. 


Presentation 


Any scenario where the actor has a compromised capacity 
for judgement 


‘+ Woken in the middle of the night 


Family emergencies 
‘+ Hada couple of drinks, heavy pain medication 


‘+ Compromised judgement results in the inability to 
judge yourself compromised, 


‘+ Snap decisions 


lan for individuals to be able to declare themselves 
incapacitated or compromised; plan for them to take 
action even when their judgement is compromised; plan to 
declare someone else's judgement as compromised. Have 
clear instructions to reduce the need for judgment; the 
time to makea plan is before the emergency. 
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Objections 


People arent allowed to drink on duty. 
Being on-call, or worse, being the second or third person. 
fn call, duringan emergency can activate you at 
‘manticipated times. The only way to avoid this isto 


consider all staff on-call 24/365, which is clearly 
unreasonable, 


Sword of God 


(aka Sodom and Gomorral, Meteor Impact, Zombies) 


Ameteor has just hit the facility. Where there was a service 
centre, there is now a crater. 


fit makes you feel better, everyone in the region is OK but 
‘more than a little distracted. 


Presentation 


Any regional outage that results in an entire service being 
lost. Limited to no staff in the region able to respon. 


© Power outage 


Natural Disaster (storm, tsunami, earthquake) 


Epidemic 


* War 


Civil Disruption (protest, riot) 
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Objections 
Don't be over-dramatic 


During the 2005 Ice Storm in Montreal, a colleague's 
phone rang with a request for technical assistance from 
another company. The company, located in Montreal, had 
been without power for two days. Generators had been 
activated, and the facility was operational; however, due to 
the high demand for fuel replenishment and the state of 
infrastructure, they could not secure more diesel. Their 
three-day supply was about to run out. 


Aheroic effort was undertaken; unfortunately, due to the 
‘massive infrastructure disruption, we could not rebuild 
their services on our infrastructure before the fuel ran out 

leaving hundreds of thousands of Canadians without 
service for weeks. 


Demonic Possession 


(aka Planetary Alignment, Plumb Bad Luck) 


You've done everything perfectly, but a small daemon is, 
inside your computer. As you type your solution, it waits 
inside for an inopportune moment and messes something 
up. Something important. 


Presentation 

Software systems are complex, and complex systems are 
just that... complex. Complexity leads to unpredictability, 
‘and that is basically random behaviour. This can presentin 


all kinds of ways, none of them predictable. 


‘+ fat-fingering, typos 


169 


+ stufjust stops working. nobody knows why 
Objections 


If cant predict it, how can | plan for it 
‘This is fatalism, giving up, and that we must not do. 


Preparing fr bizaro land is not easy, but itis possible. 
PEO Generally, this is done through constant testing and 
rehearsal (you are rehearsing disasters, aren't you). This 
%} forces people to practice system failures and general 
recovery under controlled circumstances, 


Conclusion 


Published initially on my private consultancy website in 
2013, this became something I wanted to preserve, share, 
and keep living. have shared it with every company I have 
‘worked with, but it needs to be more widely distributed 
because I have yet to see a company that can handle all of 
these scenarios. 
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De-duplicating Data 
Storage in Data Science 


How to Not Store the Same File Twice 


‘+ Introduction to Data De-duplication: explores the 
concept of de-duplicating data storage 


‘* Addressing Duplicate File Storage: discusses the 
prevalence of duplicate files in large file share 
systems, 


‘© Optimizing Storage Costs: presents de-duplication 
asa solution to reduce physical storage 
requirements by showcasing potential cost savings 
and efficiency gains 


sey 


‘Many years ago, I read an article about Google's internal 


(BPG) labs creating a sha1 hash collision between two PDF 


documents". The documents were very different, but 
{de6j ttough ceverbit manipulation, they resulted in dential 

shot codes. Itwas a fascinating read fora Friday 
afternoon, but over the weekend, started to ask the 
‘question: Why does Google care? 


(ne significant place where this would impact Google 
‘would be on their storage platform, Google Drive. Given 
that they are storing massive numbers of files on behalf of 
‘massive numbers of people, there will, in all probability, 
bbe a massive number of duplicate files. 


Logically, this is likely true through its usage. Assume I 
engage in areal estate transaction, exchanging emailed 
PDEs with scanned signatures is not uncommon, 
Assuming everyone is using Google Drive to back up their 
documents, four people have copies of the same 
document: the buyer, the buyer's agent, the seller's agent, 
and the seller. Remember to add lawyers and lenders later 
in the process. 


have worked for organizations whose primary business 
involved the interchange and storage of data (Oil/Gas, 
Production, Telecommunications, or Data Repositories). 
Inevery case, our solution to the problem was simple: 
charge the customer. Charge the customer the variable 
cost asa rate per byte. Multiply the 


‘© amount of drive space they take up 
‘© SHAttered, gives a summary of the findings and links to 
the relevant researchers, 

https: //shattered io/ 


17% 


+) replications space 
+ server cost 

+ electricity 

+ rent 

+ markup 


Charging the customer is an excellent way to offset the 
cost, and Google does bill its customers. However, itis, 
possible to maintain the same revenue through 

de- duplication while drastically reducing the amount of 
physical storage. 


‘Take the GSMA-RCC® specification, which states that 
mages can be interchanged between client devices and 
should be retained on the server for later pickup. So, ifa 
‘meme goes viral, thousands of individuals may forward 
the image to one another, resulting in thousands of copies 
of that image flying across the network and being stored 
con server drives. 


Further, it's a good meme, people will forward it back to 
people they know who have already received it. The image 


{goes round and round, and it only stops once it has, Besse 
theoretically been sent on every possible communication :#3265 
between pairs of people. eves: 


‘ Rich Communication Suite - Advanced 
Communications, Services and Client Specification, 
Version 11.0 

https: //wwwgsma.com /futurenetworks/wp-content/uplo 
ads/2019/10/RCC.07-vi1.0.paf 
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With 20 billion devices on the planet, more than Seven 
© ridges need tobe crossed 


= If some jerk sends that meme as a bitmap, each image 
instance takes up about 1079KB. 


‘= $1.00 per MB revenue | = datasize * transfers * 

= $090 per MB cost | (revenue cost) 

1 pi000 usar = TMB tmilion *($1 - $0.8) 
Interactions (transfer) 


TB of storage 


= 1024KB fe = $100,000 profit 


We can be sensible about it and require that everything is 
converted to PNG (notice Google asks to do this for Google 
Photos), reducing the size to 615KB (60%). This is useful 
when dealing with a fixed revenue where people are paying 
a flat fee, but lossy compression is not a true copy of data 
and is not feasible for legal and science datasets. 


* Digital Devices Are the Backbone of Every Organisation 
Are You Managing Them Properly 
hnttps://technative-io/digital-devices-are-the-backbone- 
of-evey-onganisaton-ae-you-managing-them-prope! 
y 
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‘Alice ——> meme.pna @ Alice |—w! Binary File 1 


0b >! awesome png @ Bob |_| inary File 2 


Candice |b! lame.pag @ Candice |e! Binary File 3 


Storing the object ance per person is how the users perceNvea fle 
system 


‘Storing the image once per person receiving itis an 
inefficient use of resources. If, on the other hand, we can 
{identify that itis the same image, we can reduce our cost 
to storing only one instance but charging for each 
transfer. 


+ S1.00perMB revenue | = (datasize *tanstere* 
+ Sa90perMBcost | revenue) (datasize* cost) 
> 100,000 user (iM * tmillon* $1)“ (MB + 
interactions (tansfet)_ | $9) 
+ MaKe fe MB of stoage 
$999,999.10 profit 
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‘There isa big difference between having a cost of 9o¢ per 
transaction and a cost of 90¢. 


Aice — meme.pna ale Pa 
ton | mesa og @ ob ef Bary let) 


Candice ——® lame.png @ Candice 7 


dentifying that everyone is storing the same data, alows us to 
significant reduce the amount of space we consume wil giving the 
‘game level of service to ur users 


However, de-duplication is not an easy problem to solve. 


First, we receive the files independently of one another 
from different people with different names, Since we have 
‘many large files, comparing them byte-by-byte will take 
significant processing power to every other stored file. Ina 
large-scale system, this simply is not feasible. 


‘This brings us back to where we began: Google had 
‘managed to cause acollision between shat in the lab. Why 
‘would they have been concerned with researching the 
extreme possibilities of collision in binary documents? 
[Because they use hash to help identify the uniqueness of 
files. 
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‘ice |! meme.pna @ alice | tani 


| +f prc | 10 


We canuse ange hathes as primary keys for the les. 


A Working Example 


Pulling on some experience solving this problem using 
Couch, we ean implement an in-browser 

demonstration ofthe principles involved using JavaScript 

and PouchD&. 1 you arenot familar with PouchDB and. enc 
using its DB interface read the definitive introductionby | $2 2 
Nolan Lawson ne 


We will create a simple example with three users (Alice, 
Bob, and Carol) sharing their favourite lines from a new 
Opera* they just saw. 


‘Secondary indexes have landed in PouchDB, 2014,-05~01 
hnttps:/pouchab.com/2014/05/01/secondary-indexes-hav 
-landed-in-pouchdb html 
“The Pirate Movie, 1987. Not quite The Pirates of 
Penzance, but close enough. 
https: //www-youtube.com/watch?v=ingtiyvUC-s&t=1850s 
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Let db = new PouchOB("Filestore"); 
async function main(){ 
‘aroit db.destroy(); 
db = new PouchD8('filestore"); 
anoit Createlndex() 
await save( 
alice’, 
gilbert.txe 
Tan the very sodel of 2 modern major general 
) 
aroit sove( 
bob 
sullivan.txt’, 
Tan the very model of @ modern major general 
) 
aroit sove( 
carol’, 
gilbert.txe 
T have knowledge of things animal, vegetable and 
sineral 
) 
arait sove( 
alice’, 
sullavan.txt’, 
have knowledge of things animal, vegetable and 
sineral 
) 
Let data = await db.al100cs ({include_docs:true}) 
console. 1og( "0B Size: ${JSON.stringify(data).length)") 


console.1og(anait dload{‘alice’ , ‘gilbert.txt’)) 
console.1og(await éload("bob’ "sullivan. txt"); 
console.1og(await dlosd( ‘carol’, “gilbert. txt’ )); 
et peruser = avait db.query( ‘allfiles', { 

reduce: true, 

group:true, 

group-level: 1 
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ni: 
console. 1og(peruser) ; 
> 


(5 


‘The process is relatively straightforward: three people 
store one of two lines of text in their account and then 
retrieve them from the database. 


Of interest is the total size being stored, as well as the 
per-user size (billable size) usage: 


DB Size: 1813 bytes 


‘The real meat of the program is inthe eave and dload 
functions, which abstract away the interactions with the 
database. Further, CreateIndex defines the mechanism 
for search and retrieval. 


async function Createtndex(){ 
return avast ab.put( { 
<id: "_design/alifites', 
Views: ( 
allfiles': ( 
‘map: function (doc) ( 
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var userpath = [doc.user, doc. path]; 
cenit (userpath, doc.size) 

) tostring() 

reduce:" state 


De 
) 


lasyne function save(user, filenane, blob) ( 
return db.put({ 

“id: (user, filename]. join("@") 

path:filenane, 

size blob. length. 

Lattachmente: ( 
ot 
content _type: ‘text/plain’ 
data: window.btoa(blob) 


ne 
) 


async function éload(user, filename) { 
Jet recs * anait db.query( alifiles’, ( 
reduce: falee, 
snclude_docs: true, 
attachments: true, 
key: (user, filename] 
ne 
let blob = recs.rows[@] doc 
blob = window. atob(b1ob) ; 
return blob: 


) 
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‘The save and download functions work on the assumption 
that we are going to store copy of the record for each 
person, while the index manages a list of users and their 


files. 
user [File [size [min [avg [Max 
atice [2 [in [4s 50 56 
bob [a fas [as 45 45 
carol [1 [56 [56 56 56 
DB Size..: 1814 bytes 

Billable.: 202 bytes 


LF De-duplcation |- Before 
Erase pg an example! wang denanttan 


Oks 


& 


Reducing Storage 


‘To modify this example to reduce our storage, we must 
first change the save function to not blindly save for each 
user but instead save each BLOB asa primary object and 
track users observing it as a secondary item. 


async function save(user, filenane, blob) ( 
let hash = new TextEncoder() .encode (blob) 
hash = avait crypto. ubtle.digest(’SHA-256", hash); 
hash = Array.from(new UintBArray(hash)), 
hash = hash 
map(b *> b.toString(16) padstart(2, °@")) 
joint") 
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substr(0,4) 
let userpath = [user, filename] join(‘e") 
let rec = null: 
try t 
rec = await db.get(hash); 
, 
catch (2) ( 
if (e-status !e= 404) throw e: 
1/ ereate the object, with a list of user's using it 
ree = ( 
-id: hash, 
Userpaths: (1, 
size: blob. length 
_attachments: { 
or 
‘content_type: ‘text/plain’ 
data: window.btoa(blob) 
) 
> 
) 
, 
Af (1rec.userpaths.includes(userpath)) { 
ec.userpaths.push(userpath) 
, 
1 finally save the record 
return db.put(rec) ; 


‘This ensures that we only ever store a blob once. 


Using the hash as the record identifier means that no 
‘matter how many times itis submitted, we just keep using 
the existing record, New users are simply added to thelist 
of users using the item. Users can even make copies of it by 
submitting the same item with a different name; we keep 
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adding notes that the user has their own name for that 
record, 


"Naturally, this breaks the lookup index we were using, The 
original index used the record name (username+path) to 
look up the file, but the file no longer uses this as its record 
‘name. Instead, we need to createa lookup index 
constructed from all the users who use the same file. 


asyne function Createtndex() { 
anait db.put({ 
“id: ‘design/alifiles", 
Views: ( 
allfiles’: { 
‘map: function(doc) ( 
for (let userpath of doc.userpaths) ( 
uuserpath = userpath.splat("®"); 
enit(userpath, doc.size); 
> 
) -tostring() 
reduce: "stats: 
, 
) 
Ds: 
» 


‘By looping through each user's paths, we have updated our 
index to return the underlying file for each file use. This 
‘means no change to our download function, as the view's 
interface has not changed 


‘Note that we included the file size in the view and used the 
stats reduce method, This means that when it comes time 
tobill, wecan simply add up the number of blobs the user 
references and their total size. 


‘Re-running main gives us anew total size and billing data: 
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DB Size: 1109 bytes 


User | Files | Size 


Even in our trivial example, with a very small content size, 
our database size (and therefore our business expenses) 
decreased by a whopping 39%. A better return is seen as 
the ratio of storage size shifts toward content and less 
‘with metadata (think MP3, MPs, and PNG), 


Also, notice that the billable sizes stayed the same. 


User [File [Size [Min [Avg [Max 
ace fa far [as [50 56 
bob [a fas fas fas [as 
cal [i [56 [56 [56 56 
DB Size..: 1109 bytes 


Billable. 


202 bytes 
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Completing the Solution 


Afterall of that, we cannot forget that Google 
‘demonstrated that collisions are maliciously possible, 
therefore they cannot be absolutely trusted. We must 
perform complete byte-by-byte comparisons. Hash 
functions are useful tools for determining dissimilarity 
between binary objects, not for determining similarity.” 


‘This is a handy feature. We can quickly narrow the 
required comparisons to almost nothing using a large 
hhash. There is a good chance I will have no more than one 
‘match that needs to be checked, which is significantly less 
effort than millions. Once I've narrowed it down, the 
complete file must be checked with an extra marker to 
distinguish it from the one we already have, 


A second problem is that we need to delete records at some 
point. In this case, we can simply drop the user's reference 
to the object, though we must also observe for the time 
when no user has a reference to the object. When all 
references to the object are removed, the object itself 
should be removed. Similarly, ifa record is renamed, we 
‘must remove the old reference before creating the new 


leave these as exercises for the reader. 


Further Reading 


‘This demonstration uses JavaScript and PouchDB. I love 
in-browser protests because they are so portable: you 


' See the chapter “Using WebGL to Solve a Practical 
Problem” for a solution to probabilistic matching of 
similar large objects 
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always have a debugger and runtime environment 
available 


PouchDB isa JavaScript implementation of CouchDB. One 
of the large-scale CouchDB implementations would be a 
good start to expand the solution to something useful at 
the enterprise level. 


Itis also worth mentioning that this technique is not 
limited to any particular system. The same method could 
be implemented at the Operating System level using 
‘symbolic or hard links*. In fact, some database and FS 
storage combinations may be an optimized approach, 


¢ Manual: Apache CouchDB 


‘Manual: CouchBase 


‘Manual: IBM CloudAnt 


SS creating indexes: Secondary Indexes have 
landed in PouchDB 
er 


https: //mwwhowtogeek.com/287014/how-to-create-and 
~use-symbolic-links-aka-syminks~on-linux/ 
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How to Create and Use Symbolic Links (aka 


(Sih Sumlinks) on Linux 
See 


What is ZFS? file system that has 
‘de-duplication built 
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Not storing (almost) 
the same file twice 


De-duplicating Data Storage using Delta 
Files 


‘+ De-duplicating Data Storage: Addressing the 
challenge of redundant data storage and the impact 
‘on storage costs. 


‘+ Data Revisions and Storage Overhead: 
‘Accumulating historic datasets over time, leading 
to significant storage costs. 


‘© Cost-Effective Solutions: Introducing techniques 
to reduce storage costs through automated data 
‘management strategies, 


GitLab code sample that generates a sample 
historic change set off data, 


Anotebook demonstrating the cost savings 
‘you can achieve. 


ones 
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Inthe previous chapter, I showed a technique for reducing 
the storage load on a system that encounters multiple 
copies of the same file. This is a fairly common situation in 
ddata-intensive environments, as data scientists make 
copies of the dataset they are working on, their colleagues 
are working on, and they share it with their friends, 
Applying the techniques, we showed a 36% profit increase 
ina laboratory scenario, 


Another widespread behaviour is Data Revisions. 


‘Naturally, our source datasets change over time, changing. 
our output datasets. While most people think of keeping all 
data forever, for the most part, we are only concerned with 
the current state. However, FOMO keeps us from removing, 
now historic but redundant datasets, Over time, this data 
continuously grows, 


{Im going to share a way to reduce this cost 
by 90% over 3 years. 


90% savings in rd year 


‘An example from my past was a heat map of financial 
transactions across Canada. This dataset was based on the 
last two years of economic data aggregated at the Postal 

Code level. To place the items on the map, we had a second 
dataset of financial districts and a third set of postal codes 
and their corresponding financial district. To add a layer of 
complexity, Canada Post changes its postal delivery routes 
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regularly and, therefore, its postal codes, as the 
boundaries of our financial districts change periodically. 
‘This means that the proportion of a Postal Code that 
resides within the Financial District needs to be more 
consistent and requires constant updating froma 


third-party provider™ 


"The network drive folder looked something like this: 


Iproj/1/dashboard.html 12 KB 
Iproj/1/transactions.tns 58 B 

Ipro}/1/postalcodes.csv 33 ¥B [ref] 
Joroj/1/districts.gul 37 8 [ref] 


70 M8 


So, we have three datasets: 
‘= Ourlive financial data (a live connection) 
‘+ Allisting of postal codes 


‘* Allisting of location shapes 


‘= World Map Subdivisions, All first-level subdivisions 
(provinces, states, counties, etc.) for every country in the 
world, 

‘https: /www.mapchart.net/world-subdivisions.htm! 

= Nominatim, Open-source geocoding with 
OpenstreetMap data 

bttps://nominatim.org/ 
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‘The relationship between the datasets and the final 
aggregate that is displayed on the dashboard can be 
described as, 


sun(p.weight * tant) as ant 
from 
inner join postal p on f.postal.code = p.code 
inner join districts d on d.id = p.district_id 


While the financial summary is updated in real-time, we 
receive updated districts and postal codes every quarter. A 
reasonably regular practice is that we receive an email 
with a link toa CSV ile every quarter, and someone has to 
download the file and overwrite the current CSV files 
However, itis considered a best practice to make a copy of 
the old files in an archive folder. This is done by creating a 
copy and date-stamping it. 


Iprojt 
dashboard. html 2 
districts.gal a7 8 
postalcodes.csv 228 
arch 

districts.202203.c8v 37 NB 
districts.202204.cav 37 NB 


postalcodes.202203.csv 33 MB 
postalcodes.202204.csv 33 MB 
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Total: 210 Ma 


‘Many people recognize this patter of file management as 
very common. It is also easy to see how this can spiral out 
of control, 


‘+ Notice that this is proj/1 of. let's say about 100 
active projects 


‘+ These projects have been running longer than 2 
months (an average of 5 years) 


‘+ Wehave implemented a basic audit and recovery 
policy requiring redundant backups of the drive 
space, with point-in-time recovery capabilities 
(daily fora month, monthly for 10 years) 


Given these approximations, we have quickly consumed 
‘210MB (a tiny project) of data per project, with 60 months 
fof user copies, on slow archival disks with 150 disk 
snapshots, and 100 projects. A total of 80 TB. The policies 
and project volumes are all realistic; I suspect I'm 
underestimating the storage demand. Assuming AWS $3 
Standard storage” of the files (USD 0.021/GB), this costs 
$70,000/year (CAD). 


Amazon $3 Pricing, 2024-04-12 
dhtps://avs amazon comy/s3/pricingy 
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It's not my money ... why 
should I care? 


‘Two events made me care: 


1. Acolleague was denied access to the service 
because of concems about disk consumption and 
‘cost, and their business case write-up did not 
persuade the executives. This was a significant loss 
to the organization, 


2. Adifferent colleague asked to recover a single file 
from a historic checkpoint. The backup team 
informed him they would need to find a diskbig 
enough to restore the point in time, The backup 
team needed to restore the entirety of all the 
projects to get one file, a significant expense to the 
organization. 


‘These are demonstrations of bureaucracy getting in the 
way. They stem from a poor understanding of digital 
storage and information management techniques by the 
ITdepartment and the Data Scientists 


Sometimes, we must keep stuff moving, even when 
bureaucracy gets in the way. 


A Quick History Lesson 


‘The disk usage pattern in question is well-established and 
intuitive, usually developed by students in their first year 
of college. Changes made to complex systems can result in 
‘unanticipated outcomes. Itis also not always obvious 
which change (ar combination of changes) led to the 
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behaviour; having old copies can help you understand 
what has gone wrong. 


Having identified only the portions ofa file that have 
changed, the fundamental problem becomes very 
recognizable to most programmers (actually most content 
publishers) as Change Management or Version Control 
Over the decades, several tools have evolved to manage 
this problem, called version control systems (VCS). While 
the fields littered with VCS, afew prominent ones 
represent significant changes in how changes are thought 
of and managed. 


evs [1985 | changesto individual files are tracked 
independently 


svn [2000 | Related changes to multiple files are 
considered a single unit 


Git ]2005 | collections of changes can be managed 
as independent units 


"These three tools represent essential changes in 
understanding how the databases we store data revisions 
should be structured. 


It is essential to see the differences between instances ina gy 


large volume of data. Asking what has changed in large pe 
volumes of data can take alt of work Difference ool ( See 
AiFF-ing) became standard in the dataand programming PSE. 
toolkit (844) in the mid-70s Further, tools like patch WERE 

Ed 


offer aay to transmit (orstore) only the changes whieh 25 
‘may be significantly smaller than an entire copy. Thereare £5 
nowa plethora of graphical tools for such tasks. 
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Soe 


elds a graphical too for comparing the contents of textes to 
Setermine what has changed. These tools can be used to great eect on 
CSV files to determine ifa significant change has happened, 
(Wikipedia GNU License) 


Techniques to Reduce the 
Problem 


‘These are well-established problems, and 
well-established best practices are associated with them. 


Compression 


While not the focus of this article, compression is an easy 
and often overlooked solution to the problem. lam usually 
glad itis ignored froma solution architecture perspective. 
While patting the files in a zip archive is an easy solution, 
it does reduce the visibility of the changes: files need to be 
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decompressed before they can be compared. Compression 
should be maintained inside the solution and abstracted 
from the user. 


Having users compress their files reduces the ability of 
tools to take other actions that may have a more 
significant impact. 


Right-Scoped Repositories 


{In our initial problem description, the IT team had 
difficulty restoring the backups because they needed to fix 
the entire repository toa point in time. 


Rather than creating backups of the entire repository 
(measured in petabytes for our example), breaking the 
problem down into sub-parts may be more sensible. Many 
of our projects will get archived over time, projects change 
at different rates, and usage may decline or increase. This 
‘means that some data will have a greater or lesser 
probability of requiring a restore ata given point in time, 


Dividing the backups ata per-project level offers an 
‘obvious division point. This means that some data will 
have a greater or lesser probability of requiring a restore at 
agiven time. 


Only Store Changes 


As discussed, tools like dif and patch offera means to 
identify, store, and apply changes to larger files. Rather 
than storing multiple copies of the dataset, it is possible to 
store a primary dataset and then track the series of 
changes that have taken place on it. 


For example, country lists change regularly, requiring an 
update for even a spelling change. ACSV based on the 
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cpyiyrse Witbedia page for1S0 country codes ould be mofed 
Gs with a patch description to accommodate Hungary's name 
aa change in 2012. 


By 


0 -117,1 4117,1 08 
Hungary, ungary,UN menber state, HU, HUN, 348, .hu 
sHungary, Republic of Hungary,UN rember 
state, HU, HUN, 348, bu 


{9 This is significantly smaller than storing the entie file of 

(64 hundreds of countries fora single name change. Asa tip, 

355 ef the primary should be the one you are using, and the 
SG changes can workbaclvards. 


ASolution 


Rather than manually perform all of these steps, these are 
the problems that modern VCS applications were 
developed to solve, As an odd quirk of history, Subversion 
(SVN) is particularly well suited to handling large files as it 
only tracks the differences between states. 


While I recommend any VCS solution as an improvement 
over file copies, SVN is ideally suited to Data Analyst's 
‘management of large datasets. 


‘Take our original problem of a project storage system with 
archived folders. 


Statoids, Changes in 1S03166-1 
http://www. statoids.com/w3166his html 
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Ipros/1 


schema. json 1208 
dashboard-tenplate.html 232 KB 
dachboard.htal 100 Me 
districts.cev 37 MB 
postalcodes.csv 33 MB 
arch/ 


dashboard.20220361.html 100 MB 
dashboard.20220315.html 100 MB 
dachboard.20220401.html 100 MB 
dashboard.20220415.html 100 MB 
districts.202203.cev a7 MB 
districts.202204.cev 37 MB. 
postalcodes.202263.csv 33 MB. 
postalcodes.202204.csv 33 MB 


have modified the example to include historic reports 
built from a template. 


Rather than trying to retrofit, let's just start over (with 
project #2). 


One ofthe first improvements we can make to the storage 
structure is creating an independent archive location for 
‘each project. This independent location can then be turned 
into an SVN-controlled location. 
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nkdir -p /arch/2; 
ed /arch/2; 
evnadnin create; 


"Now, we can link the archive location to the working area 


mkdir -p /proj/2; 
ed /proj/2; 
svn checkout file:///arch/2 


Once this is done, we can create our space and apply the 
changes as we make them, 


Ipr03/2/ 
schema. json 1288 
dashboard-tenplate.html 232 KB 
dachboard.htal 100 Ma 
districts.cev a7 8 
postalcodes.csv 33 MB 
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ed /pr03/2; 
avn add *; 
‘svn commit -m “Change 2022-03-01 


copy /proj/1/arch/dashboard.26220315.htm1 dashboard. htal: 
vn commit -m “Change 2622-03-15) 


copy /proj/1/arch/dashboard.26220401 html dashboard. htal: 
copy /proj/1/arch/districts.26220401 csv districts.csv: 
copy /proj/1/arch/postalcodes.20220401.cav postalcodes.cev; 
vn commit -m “Change 2622-04-01 


copy /proj/1/arch/dashboard.26220415.htm1 dashboard. htal: 
‘svn commit -m “Change 2622-04-15"; 


Under these conditions, you will create snapshots of the 
changes at each point something changed, 


‘+ Reducing the number of snapshots you have to 
maintain, No change, no snapshot 


‘+ Backups are generated per working folder, 
‘meaning if a restore is required, it only takes the 
size of the individual project to go back in time. 


‘= Your backup footprint is reduced because SVN only 
stores the differences between each snapshot. 


Ipr03/2/ 
-svn/ 13.4 6B 
schema. json 128 8 


dashboard-tenplate.html 232 KB 
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dachboard.htal 6.7 6B 
districts.cev 5.0 ce 
postalcodes.csv 1.7 68 


"Note the creation of the folder °.evn’, this is true for most 
\VcS applications. They must create a control folder to 
track their link to the repository. Also, this folder will 
contain a single copy of the folder structure to allow it to 
detect changes, doubling the storage space in the short 
term. The total storage space in the working location 
remains untouched mainly as the changes are made. 


‘Snapshots can be viewed via SVN's log command. 


$ svn log */ -qv 


ra | jeff | 2022-04-15 00:43:13, 
Changed paths: 
(dashboard. html 


3 | jeff | 2022-04-01 12:28:08 
Changed paths: 

NW /dashboard. htm 

NH /districts.csv 

Ni Ipostaicodes.csv 


And if a data restore needs to be performed for auser, itis, 
simple to select the historical revision. 


‘Version Control with Subversion, v1.7, svn 
‘Subcommands 

https://svnbook red-bean.com/en/1.7/svn.tef.svn.cloght 
mi 
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nkair -p /tmp/history 
pushd /tmp/history; 

‘avn checkout =r 2 file:///arch/pro}/2 .; 
ord; 


One final benefit is that because the archives are no longer 
stored with the working copy, a different type of storage 
can be applied to the backups. Slower storage can be 
applied to the backups, while faster storage can be used for 
the working copy. 


The Social Aspect 


Unfortunately, in many cases ofthis problem, Ihave seen 
the interested parties pointing at one another and saying it 
is the other person's fault. Data Analysts need to be made 
aware that version control tools exist. IT departments 
need to view backups and restores as more than 
whole-system-events for recovering from complete 
system failures. 


‘The question becomes: who is this article designed for? 
Data Scientists or System Administrators. 


‘This is really for both, Hopefully, both parties will work. 
together to reduce costs and burdens on the other, but a 
change will likely have to start with the IT department. 


Further, reducing costs is seen as a negative in most 
organizations. Cost is associated with prestige; managerial 
resumes often boast about the size of their budget. 
Reducing the budget reduces prestige. 
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AScripted Solution 


Given the social problems, the simplest means of 
introducing users is to implementa simplified Version 
Control practice in an automated fashion without asking 
users, offering training, or talking about it. 


"Two steps should be taken! 


‘+ Immediately install a VCS client on every computer 
that accesses the system in question 


‘+ Integrate VCS creation into the project allocation 
part of the process 


On the project setup and allocation side, the process 
usually starts with a request for space on the computer as a 
paper form (yes, companies are still filling out paper 
forms, usually PDFS). As this kicks off alarge process 
involving multiple configurations being manually 
configured, ensure the allocation of a VCS is part of that 
process, The simplest way to ensure this is dane is to 
automate the entire process. 


#1 oan/bash 
Ipr03"; 
arch"; 
function SetupFolder ( 
F261 
pPath="8(proj}/S(F)" 
aPath="S(arch}/8(F) 
‘ath=S(aktenp 4); 
inkdir =p “SaPath”; 
‘avnadnin create "SaPath’ &h echo “Repository crested.” \ 
I echo "Repository already exists ($2)"; 
svn checkout "file://$(aPath}" "StPath"; 


proj" 
arche! 
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rv "$tPath/.svn" “SpPath/-svn’ 
ra orf "StPath"; 


proj=8(realpath ${pro3}); 
inkdir -p “Sproj/o0eaco00! 
‘echo "Project Folder: proj"; 
archeS(realpath ${arch}) ; 
mkdir =p “Sarch”; 
fecho “Archive Folder: Sarch”; 
for dir in S{proj)/#/ ; do 
ddira¢(basenane “S(dir)"); 

‘echo “Updating: Sdir” 182: 

pPath="$(proj}/S{dir}"; 

[1-4 "SpPath/-sve" Ht 
echo "= Linking project” 1-82 
SetupFolder "$dir"; 

, 

pushd "SpPath"; 

(( S(svn status | we -1) > @ )) Bb ( 


echo” synchronizing changes” 1>82; 
svn status | grep -e “*\2" | cut -c 9 | xargs svn add 
05 
svn status | grep -e “*\!" | cut -e 9- | xargs svn del 
Os 
svn commit . -m “$(date -Ininutes)"; 
vn cleanup; 
, 
popd; 
done; 


} 1e/dev/oull; 


‘This script scans all folders and determines if change has 
been made, If there are changes, it submits them to the 
archive location. Ifno archive location exists, it creates 
one. This can be run on a timer, or better yet, can use 
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not fy" to monitor for changes. Better still would be to 
installa self-serve interface like GitLab that can 


uto-allocate space and control permissions”, but .. 
bureaucracy. 


Teaching Users 


every computer in the office. By highly visible, I mean 
‘Tortoise on Windows™” or RabbitVCS" on Linux. In both 
‘eases, the users are automatically presented with icons 


@ 
S50 


a 
BE ona tal tee something unig out the folder to which 
Be they have eon ganted ace 


EO 
ie 
a 


ESR] ntl emcee 


‘The Taisen of poses laces ovrasan leans athe opt 
system level User ceive vu eeack that someting apc ed tha 
they Shui lam mor (TetaleeSVN Manual GPL) 


“ snot fy man page. inotify can be used monitor for 
filesystem events 

‘https: /www.man7.org/linux/man-pages/man7 /inotify.7. 
him 


" DataLab is aGitLab configuration designed with Data 
Science workspaces in mind 
hnttps://gitlab.comy/apub/DataLab 
htps;/ftortoisesvn.net/ 

© https: |[tortoisegit. org 

“ hutp:/frabbitves org) 
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One of three things will happen: 


1. Auser already familiar with Change Management 
Systems will be pleasantly surprised 


2. Auser unfamiliar with Change Management 
Systems will curiously explore this new domain, 


3. The user will not care... you can lead a horse to 
‘water, but you can't make it drink. 


(Once users see the log of the changes, they become aware 
that a history is kept for them, and they become 
empowered to restore their own historical data. 


(Once users see the log ofthe changes, they become aware that a history 
ie kept for them, and they become empowered fo restore ther avn, 
historic data 
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A Demonstration 


‘To demonstrate how this all fits together, as well as to 
compare the compression capabilities of various setups, 
four versions of the script were created: 


. cy 


‘Makes a copy of the files in the project folders and 
sends them over to the archive location. This offers 
baseline comparison of what our users are 
‘currently doing. 


2 FZ 


‘The filesystem is compressed. We hope our users 
use ZIP, which creates a compressed copy of the 
folder each time the backup is called. 


. ar 


Git is an excellent VCS and should be included in 
any comparison, 


. SW 
‘The script that was offered earlier. 


‘To simulate our users' activities, a script called 
demonstrate. ch downloads the history of the CIA World 
Factbook as JSON from GitHub, stores each change in the 
project folders and then backs them up. This is similar to 
‘our users receiving updates to their data files and then 
saving the old version to a backup folder. Results are 
stored in results.csv. 
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After 34 changes, we can see that Git and Zipped file 


systems perform almost equally, mainly because that is @g383 
what Git does (zips of entire file structures); compression $237, 
offers aot of savings. By comparison, SN shows almost 


no growth ata 


‘oie tie tclor bain totaly mln ath 
‘This brings us back to our original problem state: 
$70,000/year for storage, but using automated version 
control and SVN, in particular, we can reduce the storage 
costs by 90% to $7,000/year under real-world conditions. 


Aur pts of change, SVN show a iat cost savings 20%) ove our 
‘estate 


Thatisa 90% costsavings or about S63,000(year mn BRA 

2024, that isthe wage ofa Data Analyst fora year foe 
ann 

“PayScale.com indicates a Data Analyst in Canada earns an 


average of 61,918/year, 2024-04-12 


an 


‘Most ofthese estimates are on the small side. Recent 
experience has shown the involved datasets to change in 
the 40G/quarter; suddenly, those numbers are in the 
millions of dollars. 


“Ta cot sovigs lave tothe basin storage become more igleant at 
Tine progresses. 


Conclusion 


‘This technique has been applied several times over the 
decades and has been demonstrated to work with not just 
text files but also Parquet and SAS data files in Tableau and 
[studio applications. 


Optimization of these techniques requires cognisant 
cooperation from both Data and System Analysts 
Unfortunately, detailed information and change 
‘management techniques are not exciting and, therefore, 


its: payscale.com/research/CA/Job-Data_Analys 
salary 
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do not capture the attention of executives. However, a 
‘90% decrease in costs (and consequently an increase in 
profits) should be addressed, and attention to these details, 
is essential 

For this reason, I have presented these techniques using 
‘an automated but unobtrusive technique that fosters an 
environment that can encourage learning and cooperation, 


‘Mostly, these techniques become necessary to reduce cost 
as an excuse for progress. 


because sometimes bureaucracy gets in the way. 


Asa final note, remember that cost correlates to carbon, 
Reducing your consumption is easily measured in dollars 
bbut also represents less pollution, 


Save a byte, save the environment 
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How to Build a Simple 
In-Browser Search 
Engine 


Because sometimes bureaucracy gets in the 


way. 


‘+ Addressing a Common Need: Create a simple 
in-browser search engine to manage a repository 
of organizational digital assets 


‘© Overcoming Bureaucratic Hurdles: Explore using 
simple technology to overcome bureaucratic 
hurdles. 


‘+ Empowering User Accessibility: Simple tools, 
significantly impact user accessibility, productivity 
and efficiency. 


A simple example working that indexes the 
CIAWorld Factbook, and then offer a search, 
of the content. Search results offer a link to 
the actual page on the CIA's website. 


ait 


215, 


SSIERIS) scan yess agp, Ives asked to ceate ale repay fat 
299) an alt-gapped condarepostione The ida was that for 
{SEE security reasons, we would have a list of allowed libraries 


inside the costly secure environment and only those 


libraries It was a reasonably simple setup”, requiring only 
a large amount of storage and a simple static web server to 
f deliver the files, 


‘The simple filtered mirror allowed us to respond to user 
needs rapidly and quickly, adding or removing libraries 
from the secure environment, This meant the security 
‘team was much more willing to allow packages into the 
environment, knowing removing them would take hours 
rather than months or years. 


‘+ They were so thrilled we started using general 
search patterns to allow packages. 


‘+ Wequickly went from dozens of packages to 
thousands, 


‘+ Users developed a new question to phone in and 
ask: 


Which libraries are available? 


‘The list of libraries available was less than the official 
repositories but significantly larger than a human could 


© Anacada is a library repository and management system 
hitps://anaconda.org/ 

© onda Mirror is an Anaconda package for mirroring 
conda repositories 

hnttps://anaconda org/conda-forge/conda-mirror 
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reasonably be expected to read through. It was also spread 
across several folders. Allin al, it was just challenging to 
keep track of. 


Users required a search engine like Anaconda.org, limited 
to our organisationally available packages. 


An Aside 


‘At this moment, in our story find myself away from 
home on business. Having attended an evening 
lecture in a pub in Gatineau, Quebec, on using 
technical systems to affect social change, and 
having had a few drinks, | found myself walking back 
to my hotel alone in the dead of a Canadian winter. 
Trying not to freeze to death and figure out where 
my hotel was, | started thinking about search 
problems. By the time | got back to my hotel, there 
was no way I could sleep, so | began to implement 
the solution. 


Ahhh... the life of a programmer. 


Unfortunately, bureaucracy got in the way. Getting the 
static HTTP server had been a feat of negotiation. To get 
the filesystem shared via HTTP, Ihad explicitly stated that 
this would only have static rendering turned on. This, 
allowed us to skip undergoing months of security 
evaluations; no server-side processing meant no security 
risk to the network, 


In order to accommodate both the need for a 
search engine and the lack of server-side 
processing, I built a simple search engine inside 
the browser. 
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Implementation 


Ihave actually done this a couple of times before. Ina few 
instances in my career, getting tools and computers have 
been significant barriers to a simple data processing 
engine. The key to this is that you can drop a simple HTML 
file at your server location, and itcan then look up the data 
stored on the server. 


Auto-index and Loading 


WELW one requirement from the server is to ensure that some 

‘SS geAk form of auto~index“is on. The server must advertise the 

§Sie5 ef available datasets tobe processed so the client can 

BEE Sete recone Three acoup ys 
to achieve this; the simplest is to activate that feature on 
the server. 


Insimilar systems, Ihave also used “bash’ or "PowerShel1” 
scripts to list all the datasets present: no information 
about them, just that they are present. 


As --format=single-column ./data/* > index.txt 


‘The location of that file can then be passed to the script as 
its starting point for gathering all information. 


All static webservers have the ability to generate an 
index of content in a folder. In many cases servers may not 
hhave the feature on by default. In nginx, the module is 
called"ngx_http_autoindex_module 
|https://nginx.org/en/docs/http/ngx_http_autoindex_mo 
dulehtml 
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window.location; 
*${basePath}/index.txt” ; 


const basePath 
const datasets 


Once we have thelist of data to be aggregated, we can 
begin the process of downloading, parsing, and storing. 


Given we used PouchDB, we can immediately begin 
loading the text into the local database. 


const db = new PouchDB( searcher’); 
async function LoadDB(){ 
let recs = await fetch( 
recs = await rec.text(); 
recs = rec-split('\n'); 
for(let loc of recs){ 
Let content = await fetch(loc) ; 
Let nRee = { 
id: loc, 
text: nRec 


-Andex. txt); 


ds 
Let oRec = anait db.get(1oc); 


Af(Recordsiffer(nRec, oRec)){ 
db. put(nRec) ; 
) 
, 


‘The full text is now stored in the database for future use by 
the user. tis recommended that this be scanned 
periodically to see if any changes have occurred. (loaderis) 
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Sanitizing the Text 


In database development and searching, an index can be 
considered a quick reference. In the days of manual 
searching, a card catalogue offered an alphabetic search 
‘ordered by subject, which allowed fora fast lookup. Rather 
than looking through every book in an entire building, you 
can look through a smaller listing in a single box. 


Less is more, smaller is faster. 


For this discussion, having a couple of documents we want 
to search will be helpful 


Fact 21 


For every 25 percent increase in problem complexity, 
there is a 100 percent increase in complexity of the 
software solution. that's not a condition to try to 
change (even though reducing complexity is always 
a desirable thing to do); that's just the way itis. 


Fact 22 


Eighty percent of software work is intellectual. A fair 
amount of itis creative. Little of itis clerical. 


— Robert L. Glass, Facts and Fallacies of Software 
Engineering 


ur sample for discussion will revolve around indexing the 
text of Glass's Facts to allow for rapid lookup. It might be 
desirable to attach this to a microphone in the office, 
displaying an appropriate fact depending on what is being 
discussed (see chapter "The Angry Chatterbot") 
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Each fact can be placed in a separate record for our 
indexing to pick up later. This will allow us to treat each 
fact as a distinct entity. 


Mdata/fact-01.txt 
-/data/fact-02.txt 
Idata/fact-03.txt 


One issue with text like this is that alot of itis 
‘meaningless or at least has low meaning. Some level of 
transformation should be performed to remove needless 


function sanitize(text)( 
‘text = text. toLowerCase() 
‘text = text.replace(/[*-21/9,"."); 
text = text.split("."); 
‘text = text.filter(d=> (return d.Length > 3:)); 
> 


let sanitized = sanatize(facts[211); 


As part of indexing fact number 21, we remove 
+ case 
+ Non-text 
+ Short words (three characters, aka Stop Words 


‘Resulting in a sanitized list of words. (crepojs) 


aa 


l 
‘every’, ‘percent’, ‘increase’, ‘problem’, 
‘complexity’, ‘there’, ‘percent’, ‘increase’, 
‘complexity’, ‘software’, ‘solution’, ‘that’, 
‘condition’, ‘change’, ‘even’, ‘though’, 
‘reducing’, ‘complexity’, ‘always’, 
‘desirable’, ‘thing’, ‘that’, ‘just’ 

1 


‘The list can be further reduced by noting the duplicate 
words, A word count isa helpful way to weigh a term's 
value in a search. 


function WordCount(1ist) { 
let count = (}; 
for(let word of list){ 
count[word] = count{word] || 8; 
count[word]++; 
, 
return count; 
) 


Let count = WordCount(sanitized) ; 


Creating the Index 
Forthis seach, Ichose to do.amatch on each word. This 


‘was done for simplicity and because it allows for searching 
for words that are out of order. 
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Examples of the kinds of searches I want a user to be able 


touse: 
‘Aword that is overly complex (it 
complexity can be reduced to complex”) 
comp ‘Aword that is partial match 


Software olut 


‘Multiple words 


Solution software 


Words out of order 


‘These examples represent a couple of cases we need to 
handle. Most interesting isthe partial match scenario. 


‘To create an index, we needa list based on our lookup 


value that. 
Key Score | Document 
database eo | eur 
atabase 99 | eura> 
tabase 9 | eurt> 
abase 97 | eurt> 
base 96 | «ura 
ase 9 | «ura 
databas 99 | eura> 
databa 90 | eurt> 


When you search for "data search’ it finds. 


23 


document 1 
data 97 
search 100 


2EEEH for a total of 198 points, and then sorts by highest 
9545 scoring document (sarc) 
oes 

Extras 


Lemmatization 

orave 

FEES as another ayer, we can consider lemma Lemmatization 

:F¢2= yefers to the most foundational word represented by a 

“285 word. For example, “intellectual” can be thought of as 
being the sameas “intellect. This wll hep reduce the 
numberof possible typographic differences that occur 
later. For example, someone who remembers "thats" 
should find "that's". Come to thinkeo it, "is" isa short 
word, 


every 25 percent increase problem complex there 
1100 percent increase complex software solution 
that condition change even though reduce complex 
always desire thing that just. 


© MichMech offers a simple lemmatization formats for 
use in analysis. 
hnttps://github.com/michmech/lemmatization-lists/blob/ 
‘master/lemmatization-en.txt 
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‘This much simpler text version will reduce the amount of 
informational entropy (things that can go wrong in my 
head when remembering), 


‘This text is well-santized and ready to undergo indexing 
Conclusion 


‘The next morning, after writing the solution, felt pleased 
with myself for having a complete solution that Icould 
‘email to my primary customers (a few Data Science 
‘managers throughout the organization). 1 was so pleased 
that I showed it toa colleague who helped me optimize the 
index. 


Years later, its still the way the organization tracks 
available conda packages. 


{could only pull this off overnight because Ihave 
implemented this so many times. Ihave used 
browser-side indexes to implement dashboards for 
regression testing, employee workcallocation, personal 
blog searches, and a Project Gutenberg search engine. It 
‘even became the basis for aclass I taught on Introductory 
JavaScript. 


ay 
Since intially writing this, thavealso discovered that the 66-8! 
lead developer of PouchD® has written a full-text search. Geb? 
You should use his solution, which i based on the lant 

engine 


 PouchDB Quick Search, Nolan Lawson 

Intps://github.com/pouchdb -community/pouchdb-quick 
earch 

Lune isa javascript implementation of a Solr style search 

language 

hheps://github.com olivernn/lunt.js 
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Non-Optimal 
‘hiss notan optimal solution 


Inthis case, every search engine user must download and 
generate their own instance of the index. This requires the 
client to download the entire dataset the first time 
(increasing network traffic) for each browser they use. 


Ina case where I used this to build a regression testing. 
dashboard, the initial download and parse of the test 
results took approximately 3 hours. Caching made it 
almost instantaneous if you kept up to date each day, but 
that first load was a big one. 


One critical advantage of server-side processing is the 
ability to take the inbound data and generate the indexes 
‘once forall the customers. You could split the difference 
by generating the index on the server and having the 
clients download that constructed index. This would split 
the processing load between a central processor for the 
central data and distributed processing for the individual 
searches. 


‘This hints at an interesting balance between shared and 


#59) aistributed processing. Using databases that support the 


Couch interchange protocol, ech user can process 
new data as they find it and share the results back toa 
central pool used by the next person, This Lazy-Load form 
of data processing has some theoretical advantages (in a 
trusted environment). Itwould require very litte central 
processing (expensive) and instead rely on existing 


© CouchDB is the original Open Source database in the 
family of products, however there are several related and 
compatible products: CouchBase, CouchDB, PouchDB, 
Cloudant 
hhttps://en.wikipedia.org/wiki/apache_CouchDB 
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workstations. At the same time, it would share the 
workload to ensure no one computer took the brunt of the 
whole calculation, 


Lastly, if this solution intrigues you, you should look into 


the Lucene-based searches provided for CouchDB, which 
are included with all commercial offerings. 
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Fishalytics 


A (failed) experiment in data analysis as a 
behaviour modification tool 


‘+ Inefficiencies in daily life can act as an inspiration 
for social change. 


‘+ Asimple record-keeping can evolve into a platform 
‘with social and ecological implications. 


‘+ Good ideas need good marketing to be successful. 
Social change requires social acceptance. 


12 catch more fish by cooperatively tracking and 
® sharing sport angling data (and improve 
{ecological management at the same time) 
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Once upon a time, I gave up my 100-hour work weeks as a 
software developer, moved to the other side of Canada, 
and bought a small-holding farm. One of the first things T 
did when moved to the new region was to buy a fishing 
license, and I was shocked when they handed me an 
official catch form with my license. 


Inthe province I had moved to, when you have a 
recreational fishing license, you are expected to keep track 
of the number of fish you catch, the region you caught 
them in, and what species they are. 


Forayear. 


‘Iwas struck by how inefficient this was. It was doubtful 
that I would remember every fish I caught for a year, and it 
‘was also unlikely that I would remember where I had put 
the fishing scorecard that I was required to send in at the 
end of the year. 


A seeenant showing biog cena ove the yar, Wl primary ah in 
Canada an acqusirtance dd ve me witen pemieion to fh this prbate 
ahery in Eglin Afra memati kas ecko 
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‘This seemed like a simple data entry form project that | 
‘might even be able to sell to the provincial government. At 
its simplest, the idea was that I could write an application 
that allowed fishermen to enter their fishing license 
information and record their fish as they caught them. At 
the end of the year, the software could print the scorecard 
for them. 


Over the next 10 years, this tool became the focus of much 
cf my internal pondering regarding system development 
and the ability of software to affect social change. 


From Simple Record 
Keeping to Analysis 


While the software's initial concept was to be a 
record-keeping tool for myself, it was only a short time 
before I started to think bigger. 


4. If built the tool, I could make it more widely 
available. 


2. IfIwanted others to use the tool, Ineeded to offer 
them more than justaa fishing journal 


Thad just hit on my first element of social design: you 
‘must offer something to the individuals you want to 
harvest data from. Like one of the major social platforms, 1 
could harvest user data and sell it, but to do that, I first 
needed to offer something in return, 


‘There were two value-added features I could think of right 
away. 


4, The social aspect. Keeping pictures and memories 
of catching fish is fun, Sharing fishing tales in near 
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real-time would be a fun way for fishermen to 
engage with friends and a great way to record them 
for future memory. 


2. Basic analysis. Knowing where you caught the last 
fish gives you some idea of where to catch future 
fish, and using aggregate data of many fishermen, 1 
could offer solid, unbiased advice on where to catch 
the next fish. 


‘That first idea that people could share was important; 
Facebook was young enough that competitive services, 
‘were still reasonable in niche fields. Unfortunately, social 
platform development was really outside my domain of 
expertise, The second option spoke to my strengths, and it 
‘occurred to me that if I could develop a successful 
predictive product, partnering witha social partner could 
bbe performed later. Another benefit to focusing on the 
analysis was that I was the sole user at this point: building 
‘a social platform around a single individual is almost 
impossible; however, a single individual can gather 
‘multiple data points for analysis. 


‘The first focus was set: predict what leads to catching 
more fish. 


Planning the Data Gathering 


In order to suggest to users what they should do to catch 
‘more fish, I first had to think of the variables that would 
impact the ability to catch fish, The best place to 
determine this was at my local lake, staring out aver the 
water. 


4. What about my fishing (behaviour) could affect my 
catching fish (outcome)? 
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2. Also, what could affect my ability to catch fish but 
‘was difficult to measure? 


3. Lastly, what was easy to measure but probably had 
little to do with my catching fish (red herrings)? 


Inbrainstorming these things, I came up with several 
things that greatly impacted my catch rate and were 
reasonably easy to measure, 


+ Location 

‘+ Time of Day (solar declination) 

‘+ Temperature (ait and water) 

‘+ Solar penetration (cloud cover) 

‘© Covering vegetation 

+ Lue 

‘© Time spent fishing 

‘+ Timeof Year 
Unfortunately, only two seemed easy enough forthe 
average fisherman (myself) to collect regularly 
Space-time is easily captured on a phone, so anything 
involving those metrics is easy to capture: location, time 
of day, time of year, and solar declination, 
"These two items can be measured easily through 
smartphone GPS logging, By marking the start of a fishing 
trip and having the application continuously log the 
position ofthe person fishing, we can get a sense of how 
long they stood ina given location witha line in the water 


without catching fish. Upon catching a fish, itis natural to 
‘want to capture the moment. Snapping a picture again 
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through the app) marks the moment and location at which 
the fish was actually caught. 


Measuring Success 


Putting these variables together into a meaningful metric 
became the next problem. 


We need a clear definition of success to determine which 
variables lead to success. To some extent, this requires 
distinguishing causal variables from outcome variables. To 
determine how a fisherman would consider themselves 
successful, there would be no better way than to interview 
a recreational fisherman. 


Iwent fishing. 


‘You find gold where the gold is 


— Prospector's Proverb 


‘Standing waist-deep in water at my local lake, with a line 
in the water, I began pondering the variables that would 
‘make me successful at that moment. As I stood there, 1 
realized it was catching a lot of fish; catching fish is 
exciting, even small fish. So, there is an element of 
quantity. The sheer amount of fish you catch isa positive 
experience. But standing in a lake for 4 hours to catch two 
fish is not the same as hitting 2 fish in 20 minutes: the 
velocity at which we catch fish matters. At that moment, I 
caught the most impressive Small-mouth Bass Ihave ever 
caught. On a small fishing rod, abig fish is a fun 
experience. Fishermen brag about that giant fish they 
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caught. So it's not only a measure of the quantity of fish 
caught, but also the quality of thase fish. 


fishing trip can be measured as being successful by the 
velocity at which you catch fish, 


quantity of thing 


velocity = Divot 


‘There isa little more to it, as we want to factor in the 
quality of the fish caught. We can also consider the entire 
time spent as a single fishing trip. 


number of fish quality of fish 


velocity = ime ment sanding next towarer 


‘The problem with this definition was that it needed to be 
‘more granular. The end goal was to create a heat map 
representing the best places to go fishing. This heat map 
‘would represent a range from nul! (no information) to 
{9004 to bad. So, while trip represents a range of 
space-time (different fish caught at different locations 
and times), I required highly granular data that specified a 
point in space-time 
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-Atypothetical fishing tip wih two ish caught. The fish tooka certain 
‘amount of ime to catch which represents an effort on my pat. 


Here, we can take our cue from accounting. In addition to 
the time when the success was achieved, we can measure 
the time between as the cost of catching the fish. 


Inthe example above, the Small Mouth only took 20 
minutes to catch. However, I continued fishing without 
seeing another for 50 minutes. We, therefore, allocated the 
unsuccessful time to the nearest fish caught, 


Given this new perspective, we can change the measure of 
success to 


quality of the fish 


score = 
Tine Sent fishing for that Fish 


While the trip score can be considered the average ofall 
the individual fish scores. 


We are getting closer toa simple score with a clear 
definition of how to measure the time per fish 
Unfortunately, the definition of a fish's quality is not 
straightforward. 


Generally, size is considered the measure of a successful 


catch, but not all fish are considered equal. If am fishing 
ina mountain stream and catch a good-sized trout, it will 
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be a very different size from even a small Great White 
Shark. Age is also a factor: young fish will be smaller than 
older fish, 


Another wrinkle that enters when we consider why the 
‘government was collecting this datas fisheries 
‘management. They want to know the health of the 
regional ecosystem. Under these conditions, itis not 
sufficient to understand that the fish is bigger but is an 
appropriate weight for its age. High or low values could 
indicate various stresses on the population. 


Standard Weight (Ws) for Largemouth Bass and Burbot 
Weight (@) 


—argemouth Bass: 
We = 10159281127 Grom neon, 6 
——burbo: 

W, = 1086112298 rom inert 


Z ‘Total Length (mm) 
woo 200 300 400 500 600 


‘wa Standard Weights for wo differant species As lngth increases £0 
doe wieght but at diferent rates (Wikipedia CCSA'30) 


A species’ Standard Weight is a measure of the average 
size of a fish given its height. This is basically BMI for fish: 
given afish's length, we can consider its normal weight. 
‘This weight follows an exponential curve (fish get fatter 
faster than they get longer) and is unique to each species, 
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with each species having two constants that define their 
normal curve. 


ee, 
PERIEE tes de prmerforech pcesrguied 

“== Fortunately, that database exists in Fish Base", an online 
SESE catalogue of ish research worldwide 


‘To identify a fish's parameters, all required is to know the 
fish species (a common name is acceptable) and the 
location where it was caught. This results in a page about a 

fish, including its Standard Weight. For example, our 

‘Small Mouth Bass”: (a) 0.0129, and (b) 3.06, (len) 8" or 

216 mm, (weight) 1/3 lbs or 153, 


ata | | bayesian anaysis 

xtes are provided below, Dased on your selection of si 

jive less weight to studies that are far irom the regres 

netric mean a= 0.0129, mean b = i, SD log 0(V 

gth: [0.0 | (cm) =|0.00 | (g) 95% range /0.00 
“The men's and values a offered inthe footnote. 


Given this information, we can consider the quality of a 
fish to be its variance from its Standard Weight. Note that 
for convenience, scores are shifted toa positive range 
(catchingaa fish is always a good thing) between o and 
1000 ( because per mille has always amused me) 


© FishBase is a global biodiversity information system on 
finfishes 

https: //fishbase.se/homehtm 

Smallmouth Bass Length Weight sampling from 
Fishbase 

https: //fishbase.se/popdyn/LWRelationshipList php?1D=3, 
382 
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stdlleight = Stdit(a,b, mm) 
Stdit(0.0129,3.0600,263) 


= 156.39 


quality = weight - stdWeight 


stdWeight 


= 1839 ~ 156.39 


186.39 


-0.021113249762 


for convenience, convert to per mille with 
500% being the center point. 


floor (~0.021113243762 / 2 + 8.8) 
409% 


‘This should be further converted to the “score” by 
integrating the time spent fishing for the fish: 


score = quality / tine 
489 / 70 


7 


Using the standardized quality method gives us a measure 
of the quality of each fish caught, allowing us to produce 
aggregate values (such as the trip score) without concern 
for species variability 
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Analytics 


[By querying a space-time bounding box, a user can view 
places where fishing has been particularly good ar bad. 
‘This was charted using OpenStreetMap, Leaflet js, and 
Heatmaps. It is useful not only for fishermen but also for 


ecologists looking to study the quality of the fish in the 


“The default example rom HeatMap js (emo) 


With sufficient observations, this standardization of the 
data allows for several other types of analysis 


‘= Year-over-year analysis is possible, allowing 
ecologists to monitor for trends of declining or 
recovering populations. 


‘= Species comparative analysis can show one species 
filling infor another species, a common symptom 
of an environment in distress, 


‘+ Time of day analysis, or seasonality, can improve 
catch rates by fishermen. 
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‘+ Anonymously reported aggregated data encourages 
self-reporting of poaching. 


‘This finally made me realize there were two additional 
target audiences for this data. 


Fisheries Management 


‘The ability to turn fishermen's stories into meaningful 
analytics meant fisheries didn't need to wait until the end 
of the season to get paper records. Real-time catch data 
collected by customers can be used to gain insight into the 
health of bodies of water. This means early interventions 
canbe taken. 


The Ecosystem 


Fish populations are early warning signs of ecological 
disasters. Changes in the types and sizes of fish 
populations are a good indicator of the water's 
environmental health. These analytics provide real-time 
and early detection of ecological issues. 


Fishermen don't want t fish in an overfished area, and 
ecologists don't want fishing in stressed areas. A system 
like this can help identify healthy fish populations and 
direct fishermen toward those, leaving stressed 
populations alone to recover. 


Conclusion 


Fishalyties was a failed social experiment for me. After 
nearly adecade, I abandoned the project without it ever 
‘moving past a personal fishing journal. Competing 
priorities (Iwas a farm labourer), legal obligations (1 
didn’t want to lose intellectual property while working for 
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‘some companies), and theft (a customer I pitched this to 
pitched it back to my students four years later) left me just 
not working on it and finally letting it go. 


Butit wasn't a waste of time either. It has offered me my 
first real introduction to the possibility of using Data 
Analysis as a tool for social change. Also, I started to 
conceive of how passive pressure could enact social 
change. 


Italso introduced me to the idea that everything can, and 
often should, be boiled down to a single health metric. 
‘This has been beneficial to me in other data analysis roles, 
where being able to identify variance in an abstract health 
‘metric has allowed for early intervention, 


While I failed to achieve the desired results, Ihope this, 
article helps others see software and systems as more than 
forms ona page. Rather than simple forms, view them as, 
tools for evolving social systems and modifying behaviour 
for the betterment of all, 


Further Reading 


‘There are afew libraries that are highly useful for this type 
of project 


Leaflet js: a mapping library for interacting in 
© the browser 


Plotly: a general charting library. Not used in 
the project, but a staple in things Ido now 


heatmaps in Fishalytics 


owe 
BE ucaumanis:the trary usd for integrating 
a 
ae 


Also, if you want to enact social change through software 
development, there isa classic piece of work you must 
read, 


GASKO Wicked Problems: Problems Worth Solving: how d 
29700 wicked problems: Problems Worth Solving how do you 
ae Jmplement social change in complex systems? This book 


= 


Finally, while Ihave suggested valuable tools, with great 
power comes great responsibility. 1, therefore, leave you 
‘with this warning from Charles Goodhart: Any observed 
statistical regularity will tend to collapse once pressure is 
placed upon it fr control purpases, or more simply put 


When a measure becomes a target, it ceases to be a 
good measure 


— Goodhart's Law 


263 


== —* 
BZ 


Technologic 
(In)accessibility 


‘Technology as a barrier to the beach 


‘+ Making assumptions about customer capability can 
create barriers to products. 


‘+ Technology itself acts as abarrierto entry for 
‘marginalized communities 


‘+ _Ananalysis of the societal impact of technological 
accessibility issues emphasizes the importance of 
considering diverse user experiences in system 
design. 
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went to the beach a couple of weeks ago. 


‘Beaches are hard to come by on the bald Canadian prairies, 
so this was a rare treat for my wife and me. Having 
recently moved from a coastal city, we were looking 
forward to wading into some water and introducing our 
young dog to swimming. We had just learned about Sandy 
Point Beach in Lacombe County, were very excited and 
‘more than willing to spend a couple of hours in the car. 


Upon arriving, an older gentleman wearing a security 
uniform informed us that parking had recently become 
paid parking (no problem) and that we just had to scan a 
QR cade on a pamphlet he gave us (big problem). 


‘asked if the facilities offered free WiFi to connect to the 
payment service. I was assured there was plenty af service 
just up ahead. I tried to be specific that Ida not have a data 
plan on my phone, but he did not seem to understand. I 
rolled the dice, proceeded in, found a parking spot, and 
attempted to pay: no WiFi, Itdid occur to me that it was 
closer ta the change building, so I walked over to the 
building and tried there: no WiEi 


‘This was going to bea problem, 


Upon returning to my car, I found the security guy had 
already taken my plate number. I did ask him what 
alternate options were available for me ta pay. did try to 
explain that I did not have any means to connect to the 
internet, but that only resulted in him getting frustrated 
‘with me and simply stating, you just open your phone, and 
you get the internet. 


Unfortunately, this isn't true for everyone, and (as is often 
the case) assumptions regarding people's capabilities 
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result in barriers to accessibility. These assumptions 
regarding customer abilities lead to not considering 
wheel-chair ramps, hazards not being demarked for the 
visually impaired, and audio tutorials not being offered to 
deaf people. 


‘To some extent, assumptions are inevitable: not having. 
lived a particular experience, it is natural ta be unaware of 
the nuances involved in that experience. Fortunately, 
society is (mostly) aware that we are sometimes unaware 


sie piysaibanee bonymdereanede’ iii 
Importance of seeking expert advice” a 
™ = ae 
—-) a si 


“cree 3 


taimprovesightine 


+ Rarvier-free design guide, Fifth edition, Alberta 
Government. Guides, such as this, represent the distilled 
Jenowledge of experts and are a good place to start 

https:/open alberta.ca/publications/barrier-free-design- 
guide-fifth-edition 
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Unfortunately, another set of barriers can be described as 
‘Technological Barriers, which are often not even 
considered. Due to marketing and the increased cost of 
implementation, technology is sold as simple to 
implement, which unfortunately overlooks some of the 
complex nuances of how humans interact with technology. 
We must be thoughtful in our design of technological and 
physical space. 


‘Thoughtful design means to consider how a space 
is to be used ... The objective is to remove as many 
barriers as possible, 


~ Alberta Government, Barter Free Design 
Guidelines (p.2) 


Accessibility of Technology 


‘As users progress from discovering the service to seeking. 
itout, many things can prevent them from engaging. We 
all know that technology significantly reduces these 
barriers: digitally readable text increases the options for 
consuming text, and networking allows communication to 
reach the consumer rather than the consumer coming to 
the message. 


By diversifying our modes of communication, we create 
redundancies and alternate paths for our consumers to 
follow; these alternate paths allow those with various 
barriers to seekan alternate route to the same outcome. 


(ne of the risks of easy and low-cost solutions is the 
temptation to use them to the exclusion ofall else. This 
leaves no means for those with accessibility issues to 
bypass the barriers in their way ..and without personal 
experience, we will likely be unaware that those barriers 
exist. We need to rely on experts with domain knowledge 
‘to avoid making assumptions based on our expertise. 


LEME 


In any custome’ progression toward success, several bares slowly 
white away at those who ean enjoy the product or service. 


Inthis case, the assumption is that everyone has access to 
‘mobile devices and mobile data plans through one of the 
‘major Canadian providers. Unfortunately, this isn't true 
for 2496 of Canadians”, who, in a 2019 study, did not have 
access to a Smart Device. Many more do not pay for the 
internet to be accessible from their devices. 


* Wait: 259 of Canadians sill don’t have ANY kind of 
mobile phone?, A Journal of Musical Things 
hntps://waw.ajournalofmusicalthings.com/wait-25-of-ca 
nadians-still-dont-have-any-kind-of-mobile-phone/ 
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vagy Thishas changed during pandemic lockdowns, but theres 

288 scope seg cs 

BF that 20% of Cavarans dot hare dats ylang on thee 

BREESE mobites as of 2021 This is even more pronounced among 
‘ral Canadians, at 27%, the very customers the rural 
cent era Hyg Oars 


‘smartphone Ownership: The Mobile Disconnect 


ene one Ques The Mobile Disoanct> 


Cost of Technology 


Inthe mid-2000s, the trend was toward shared WiFi and 
WiFi networks. This was a low-cost and ubiquitous 
solution to internet connectivity in urban areas (I loved my 
‘Nokia N800). As WiFi was ubiquitous in coffee shops and 
offices, many people never felt the need to purchase data 
plans. 


+ Smartphone Ownership: The Mobile Disconnect, Statista 
Inttps:/www-statista.com/chart/16937/share-of-adults-w_ 
ho-awn-no-mobile-phone-or-have-a-non-smartphone 
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‘This is significant because, apparently, Canadians pay a 
Jot" for their mobile data—among the World's most 
expensive” rates. In Canada, itis not unreasonable for a 
‘couple to pay $1800-$2500/year for internet connectivity 
on their phones, which cost $720-$1200/year". 


ifthe, Pris 
about $82,000/year, a total data cost of $3700/year 
represents. a significant proportion (4.5%). Considering 
the considerable acceleration of inflation in Canada, 
leading to an 82% inflation rate" in 2020, one can foresee 
that Canadians will be seeking means to reduce their 
household expenses, and a 4.5% budget item, with an easy & 
‘workaround (public WiFi) is an obvious candidate for cost 


% Do Canadians pay too much for intemet and cellphone 
service?, CBC News 

https: //www.cbe-ca/player/play/video/1.6522869 

°> Worst in the world’: Here are all the rankings in which 
Canada is now last, National Post, 2022-08-11 
|https://mationalpost.com/news/canada/worst-in-the-wor 
|Id-here-are-all-the-rankings-in-which-canada-is-now 
last 

Telus and Shaw Mobile Phone Plans, 2020 
hnttps://www-telus.com/en/mobility/plans 

https: /wwwshaw.ca/internet/plans/ 

7 Household income statistics by household type: Canada, 
provinces and territories, census divisions and census 
subdivisions 

https: //wwwiso.statcan.ge.ca/ti/tblajenjtv.action?pid=981 
(00057 018pickMembers%5B0%5D=1.4275&pickMembers 
%5B1%5D=2.4 

+ Inflation rate will remain ‘painfully high’ all year, Bank 
of Canada governor anticipates”, CTV News, 2020-07-21 
https:/www.ctvnews.ca/polities/inflation-rate-will-rem 
ain-painfully-high-all-year-bank-of-canada-governor- 
anticipates-1.5995379 
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reduction. 


Canadaic curently expeancing uptecedertedafiaton a 1%. Stee ety 
202, ation hasbeen scceeang”™ 


and that is not even the group we are concerned with. 


When designing a system with an eye to accessibility, itis 
essential to consider those outside the norm. Having an 
average or above-average income represents a privileged 
‘group of decision-making individuals. I's easy to forget 
that about 16% of the population is financially vulnerable. 


* Inflation Control Target, Bank of Canada, 2020. Red 
annotations added for emphasis 

https: //wwwbankofeanada.ca/rates/indicators/key-varia 
bles/inflation-control-target/ 
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Inthe case of governmental resource management, itis 
important to consider the less advantaged and vulnerable 
populations of the community. A day at the beach 
represents an attractive, cost-effective activity for 
low-income and vulnerable individuals (retired 
pensioners, single parents, or those who have just fallen. 
on hard times). Socially, these people are best served by 
having access to public resources. At the same time, they 
are also the most vulnerable to having to make hard 
budget decisions. 


Table Topping _—— 
ied stctcaliies Mecasietel fe 
wall through the proposed system to identify potential! 
vulnerabilities. While this walk-through should bedone in 
the actual environment, iis worth doing asa tabletop 
exercise" in its early stages. 


By designing and testing the system by visually describing © 
the anticipated process, you can map pracess points 

subject to issues and barriers. This i similar to risk 
mitigation in project planning but differs in that it 

requires a system to have been designed first. You are 

testing a design to ensure you have handled all cases 

rather than trying to plan for all cases. This should also be 

an iterative process: design the system, identify flaws, 
propose change, and repeat. 


‘ “Tabletop exercises explained: Definition, examples, 
and objectives", Josh Frublinger, 2024-04-02 

https: /www.esoonline.comyarticle/570871/tabletop-exer 
cises-explained-definition-examples-and-objectives.ht 
mi 
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‘Aguick mind-map addressing potential negative outcomes that should 
be aderessea 


‘These issues and barriers can be documented and 
considered for the probability of occurrence, impact 
significance, and mitigation plan. Only some things need 
to be handled, but they should be acknowledged. 


‘Scenario teeue Prob [Sev | Mitigation 
ParkPayOnsite- [NoData [20% | High | Kiosk 
Website 


ParkPayOnsite- Not Exist | 20% | High 
Kiosk 


ParkPayOnsite- [BadNework [1% | High | Free Day 
Website 


Its very common in this process to have our bias show 
through and to be dismissive of an issue that is difficult or 
uncomfortable to address or sometimes challenges our 
‘worldview (eg. [have the internet on my phone, everyone I 
‘mow has internet on their phone. Therefore everyone has 
internet on their phone). This is most dangerous at 
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executive levels, as off-hand remarks may communicate” 
decisions and desires to analysts and designers (having 
opinions is normal, but voicing them can be dangerous). 


Conclusion 


Having walked through the case, we can see how 
important itis that organizations approach their system 
design with an eye to accessibility and take active 
‘measures to prevent analyst and executive privilege from 
creating abias that excludes the vulnerable. We can also 
see some tools used to mitigate and manage these issues. 


Unfortunately, this particular technology implementation 
to support parking payments at Sandy Point Beach has 
‘made the beach inaccessible to many people. While 1 
understand and agree that the county should be charging 
some fee to offset the creation of an artificial beach in the 
middle of The Prairies (I saw the mountain of sand off to 
the side), having no onsite means of payment available to 
patrons (WIFI, cash kiosk, or digital kiosk) means that, 
‘what should be an affordable, and accessible, form of 
entertainment for residents, and tourists, has an 
insurmountable technological and financial barrier for 
many. 


Lacombe County indicates that this trial run focused on 
education, so no tickets were issued. They also note that 
payment can be made in advance at the County Office or 


"Thomas Becket was killed by King Henry's knights when, 
inamoment's rage, he said, “What miserable drones and 
traitors have I nurtured and promoted in my household 
‘who let their lord be treated with such shameful contempt 
by alow-born clerk”. That off-handed comment, sparked 
a murder, a revolt, and nearly cost him the throne 
hhttps://mww-britishmuseum.org/blog/thomas-becket-mu 
rder-shook-middle-ages 
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online. However, neither option is mentioned on their 
website. Perhaps this is an exercise in table-topping, and 
they have taken the time to consult the appropriate 
experts 


Unfortunately, this demonstrates how we (as 
decision-makers) can fail to identify these issues as we 
rely on our own life experiences and fail to be aware of the 
diverse experiences of others. We can fail to address the 
problems in advance or create viable alternative success. 
paths. 


Inthe end, itwas a frustrating start to what was supposed 
to bean exciting day for my wife, myself, and my dog. It 
hhas tured a public resource into a technologically 
accessible one only for the privileged. 


Healthcare professionals say that dangerous heat 
puts marginalised and vulnerable communities at 
risk because low income populations have a more 
dificult time accessing cooler spaces and 
green-spaces 


— Millions of Canadians try to stay coo! during heat 
wave, CBC, 2022-08-12 


Sieve 
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Paper as a Digital 
Storage Medium 


Distributing Data in the Present and 
Preserving Data for the Future 


‘+ Describes the concept of utilizing paper as a digital 
storage medium 


‘© Combine the reproducibility of digital data with the 
long-term storage capabilities of paper. 


‘+ Explore the challenges of modem information 
distribution and preservation, including the risks 
of centralized data storage and the loss of historic 
copies. 


P25 an experiment in using barcodes as storage medium, 
§ Create an EPUB reader that stores its data on paper. 
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Two Stories 


2 A story about anonymity 


Bie #e es 
; LCA n recent times, we have seen a war of information. In 
ISIS si, news sources arebeing dened? foreczing 
E60 their invasion of Ukraine. InChina, online speech is 
% ‘monitored and can result in punitive damages" for 


e568 individuals. Saudi Arabia asks neighbours to denounce 
“© each other. 


Sd my grandparents migrated from Europe to North America 
eae 
EAGLE stern europe ad uoubes after the war, and as ith 
“5 50 many other refugees, everything my grandparents 

Possessed had been fos, soamove to anew landfilled 
tith opportunities captured thelr imagination grew up 
on stores my Grandmother and various Aunts passed’ 
dow to me, which ngpited metoread more 


BE onenaratvethatavaysstruckme waste bumingot 
{Ely fosbidden books, We have seen thin sgusial counties 
S$ i3Gfjt over the centuries as specific belief systems are 

Bee uppressed. still, the most famous was the raid on the 


Institute for Sexology in Berlin in 1933. 


5 "More Russian media outlets close as Moscow cracks 
down", Anna Cooban, CNN Business, 2023-03-04 

https: //www-enn.com/2022/03/04/media/russia-media-c 
rack-down/index html 

"Hong Kong's Crackdown on Dissent Hits Facebook 
ages", Newley Purnell, Wall Street Journal, 2022-08-16 
|nttps://www-ws),com articles hong-kongs-crackdown-o 
1n-dissent-hits-facebook-pages-11660645401 
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‘The Institut fir Sexualwissenschaft was the leading 
organization dedicated to the study and advocacyof 
altemate sexuality in Europe.On May 6th, Government, W. 
Officials raided the facility. Much ofthe early research (PAGE 
(and advocacy) of gender studies was dragged out into the 

streets and dramatically destroyed for being un-German, 


When people first learn of the story, they are rightly 
distressed about the knowledge that was forever lost. Still, 
lesson is learned from the later stories of lost treasure 
troves being recovered from someone's basement after the 
war, Here is how the story goes in my head. 


Magnus Hirschfeld publishes a great work and gives 
all his students copies. The professor and his 
students are arrested and executed, and their 
personal libraries are looted and destroyed. 
Fortunately, Li Shiu Tong, one of his students, had 
lent the book to an acquaintance. The acquaintance 
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was sympathetic to the NSDAP authorities but did 
not want to cause trouble for his friend. He had put 
iton his bookshelf and forgotten about it. Years 
later, when he died, his wife put all the books into 
boxes and stored them in the attic, where they 
stayed for the next 50 years because nobody was 
looking for a forgotten book in a forgotten 
collection 


Inmy internal narrative, this happens on the East German 
side, where Stalin continued to suppress homosexuality. 
‘The book is completely lost, except for that one accident in 
which itwas put away ina box and forgotten about. Ithas a 
chance at a new life when society is ready for change. 


‘The ability to be forgotten and anonymous is significant in 
disseminating dissenting opinions. 


Inthe modern era, as information delivery systems have 
become more robust, we see the same destruction of 
Jmowledge taking place, though more subtly. As the 
distribution cost has been reduced, we have seen data 
become centralized: itis much easier to visit Wikipedia on 
your phone than to download the page and carry it around. 
Also, Wikipedia has an open edit history associated with 
the documents; not all websites are so open. 


‘This leads to two risks: 


1. Thereis the risk of the lone copy ina single 
organization's archive and content being removed 
from the library (webserver). In the example above, 
Hirschfield indicated that his library should be 
donated to the University if the Institute is closed. 
‘This never happened; the forced closure was 
deemed legal, and all copies were destroyed. 
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2. This centralization means that content can be 
edited without maintaining historical copies. Since 
the edit history is lost, it can never track significant 
shifting of opinions. History can be changed, 


‘The Internet Archive demonstrates" the need for this 
‘websites and content are regularly removed from the 
Internet for reasons as innocuous as cost (part of the 
reason Git was developed was to protect OSS from being 
lost to public servers being shut down) and as nefarious as, 
‘governments shutting down news stations* to silence 
dissent. Central repositories like the Internet Archive help 
to protect knowledge by allowing us to observe changes, 
but they also put knowledge at risk by being the only 
keepers of history. 


Distributing the data across many bookshelves protects it 
from complete loss. 


Astory about storage 


‘Many years ago, Iheard a story. I don't know if itis true, 
but it carries a valuable lesson. 


% Building Democracy’s Library, Intemet Archive Blogs, 
Chris Freeland, 2022-09-06 
|https:/blog.archive.org/2022/09/06 fbuilding democracy 
s-library-celebrate-with-the-internet-archive-on-octob 
er-19/ 

‘an Egyptian Perspective on American Book Banning, 
Hassan Said, Internet Archive Blogs, 2022-03-10 

https: /fblog archive.org/2022/03/10/guest-blog-an-egypt 
ian-perspective-on-american-book-banning/ 

© More Russian media outlets close as Moscow cracks 
down, Anna Cooban, CNN, 2022-03-04. 

https: //www.cnn.com/2022/03/04/media/russia-media-c 
rack-down/index htm! 
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Inthe early ‘90s, an amazing product became accessible 
that allowed people to generate a lot more data than they 
ever had and of a higher quality than ever before: 
‘Microsoft Word, What had previously been stored on paper 
could now be digitally encoded and stored on disk. The 
archivists loved it; they were stuffing data onto disks left, 
right, and centre. 


Inthe late '90, Microsoft upgraded Word. 
Into an incompatible format. 


‘There was no way to recover all that long-term stored 
data, Legally, they were not allowed to, as it had to be 
stored exactly as it was placed into storage (and signed off 
on). 


In another twist, magnetic storage degrades over time and 
is subject to limited environmental conditions. Its very 
‘easy to damage the storage medium. 


Inthe story Iwas told, Archivists at the US Congressional 
Library said, "You know what doesn't degrade? Paper." 
‘They promptly started printing everything to paper, 
bundling it, and storing it in the existing vaults 


What if there was a way to have the best of both worlds? 
What ifitis possible to have the fidelity of digital storage 
with the lifespan of paper, the volume of transmission 
available in Smart Devices, and the anonymity of 
in-person conversation? 


Unfortunately, much of the data produced now is dynamic. 
‘By dynamic, you can interact with the visualization itself 
(scroll through a map, rotatea 3D model, filter, search, 
and aggregate massive datasets), but once it has been. 
printed to paper, that is no longer possible. 
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Also, it takes alot of work to transfer large data tables 
from paper to digital media, Scanning the dacuments as 
mages and using OCR to collect tables of information 
loses significant amounts of metadata: 


‘+ Data Types must be guessed from the content 


‘+ Alignment issues cause data to be considered out of 
context 


‘© Character fidelity can cause incorrect values to be 
interpreted 


While high-resolution photography and Artificial 
Intelligence have certainly improved the quality of 
scanned content, the datas still transferred 
analogueically, which can result in mistakes. 


Defining the Problem 


What if there was a way to compromise between the two 
worlds: the long-term storage of paper, with the high 
fidelity of digital; the anonymity ofa private conversation, 
and the distribution capacity of a computer network? 


We are looking for a means to store digital information on 
physical media such as paper or etched into stone, We 
‘ight call this visible media, 


Properties of Digital 


Companies, governments, and individuals have a desire to 
store data for long periods for legal archival purposes. This 
ishard to do, Over the past 20 to 30 years, digital storage 
costs have reduced as we moved from paper to magnetic 
storage. This presents a problem for archivists who must 
store the resulting volumes of data, As it becomes cheaper 
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for us to produce data, it becomes a greater challenge for 
archivists to store it. 


‘The data must have a simple interpretation: it must be 
stored in a format easily converted to something a human 
can read. Open Source standards are advantageous as they 
are unencumbered by intellectual ownership and are 
readily understood by a larger pool of experts. 


Copying digital data is something we take for granted. 
When we make a copy of digital data, itis an exact copy. 
For example, music loses some fidelity when recorded in a 
high-resolution format; however, the replication of the 
‘song from that point forward retains an exact copy (at the 
resolution of the bit). 


Properties of the Storage Media 


While etching into stone or carving into wood are viable 
options, the weight and volume of these media presenta 
barrier to storage space and weight. Linen and cotton 
sheets represent lighter options but are expensive to 
produce. Mylar and projector film reduce the size, which 
offers good potential. 


‘Modern archival paper represents a balance of 
permanence, weight, and volume. Each of these could (and 
should) be considered for various purposes; the solution 
should be adaptable to all these solutions. We discuss 
paper as the primary media because paper has such a rich 
evolutionary history as a storage media, 


A digital storage mechanism must offer a reasonable level 
‘of compression. By compression, we refer to the number 
of bits of information stored per square inch or pound. 
‘This means it should be able to keep the record ina 
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physically small space, though this must be balanced with 
the ability to read itback easily. 


Solution 


Combining the needs of both these mediums allows us to 
integrate existing technologies to create a unique solution. (2 
‘PUB is an Open-Source container format® for Electronic 
Books that offers a standardized™ and unencumbered 
format for many data types. Further, 2D barcodes (inthe 
form of QR Codes) have become ubiquitous to transmit 
URLs. However, fundamentally they are just binary 
buffers capable of storing any encoded sequence of 
numbers. 


ePUB 


‘+ Diverse data storage 


‘Compression 
‘© Accessibility Conformance 
‘+ Widely Consumable 


‘The transition from paper publishing to screen-based 
‘mediums brought some transitional challenges. PDF was 
popularized to digitize paper and act as an intermediary 
between paper and digital formats. On the polar opposite 
end of the spectrum from paper, digitized standards (such 


© EPUB 3.3, W3C Recommendation 

https: //www.w3.org/TR/epub-33/ 

ISO/IEC TS 30135-1:2014. Information Technology ~ 
Digital publishing - EPUB3 

https: //www-iso.org/standard/53255,html 
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2s those developed by the WC) have been optimized for 
delivery oan unknown display 


"HTML introduce the idea of reformatting content to 

@ ae adjust to meet the consumer's needs. This meant that the 
text could be read bya scren reader, could reflow for 
people reading on small sereen,or could be made more 
Significant for people with poor eyesight. This accessibility 
of the format gavebirth toa plethora of other standards 
hove managed by the W3C These standards ensure 
maximum availablity tothe most significant number of 


PUB takes advantage of these standards to encapsulate 
‘websites into a single document. They embed webpages 
into a ZIP file format to allow for the contained viewing of 
the entire website. Generally, the documents are organized 
into Chapters. 


sug Anyone conrad and decode digital document sng the 
HS common eP UB format. ePUBY3 allows for avaSerip tobe 

oe embedded, meaning you cul embed maps, interactive 
diagrams, te (like shiny, but self-contained) Asa 
general W3C container, tis possible to embed other ile 
formats for consumption and preservation: datasets as 
C3V or evidence inthe form of vide. 


Barcodes 
You can encode digital information into barcodes, which 
can then be printed to paper for long-term archiving, and 


the barcodes can be read back to a digital device for 
reading, 
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2 Dimensional barcodes have been used for decades to 
encode specialized information BEML", text,orather 
data formats have been appended to printed documents, 2389, 
suchas Driver's Licenses and involces, to supplement the ysie 
text with digital information This usually amounts toa 

unique document identifier ora digital record 


Encoding an ePUB should be trivial, with there being 
several issues: 


1. The encoding scheme must be identifiable by a 
reader (there must be sufficient information 
embedded in the data to allow a reader to 
reconstruct the correct form) 


2. The size ofa single book will likely exceed a given 
aD barcode's storage capacity. An encoding 
mechanism will have to be able to span multiple 
image tiles. 


3. Asocial issue must be managed because humans 
‘cannot read the codes directly. I is possible that 
they do not wish to view the material for legal, 
religious, or moral reasons. There must be 
sufficient metadata to allow the viewer to decide 
not to accept the message. 


Once identified, the issues are easily overcome; adding 
‘metadata to the individual tiles in the application 
identifier, pagination, title, author, and subject should 
offer sufficient information to allow users to interact with 
individual tiles and reconstruct the data, 


‘ Business Rules Markup Language (BRML), Cover Pages 
‘Technology Reports, 2002-11-05, 
bhttps://xml.coverpages.org/brml.html 
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A prototype of the concept has been created to 
‘demonstrate the capability. The prototype's protocol, 
consists of 


AURL: which points to the reader for either online 
use (browser only) oF installation as a PWA, or just 
asa unique identifier that this is a compatible 
format. 


‘AProtocol Version: as changes are made, the 
correct decoder must be used. 


Pagination: the current tile number and the total 
‘number of tiles to be converted. This allows for 
correct sequencing as well as a measure of 
progress. 


Bibliographic: Title, Author, and subject allow 
readers to decide if this content interests them oris, 
legal for them to interact with, Filters can be added 
to prevent accidental downloads from taking up 
space. 


Parental Rating: not so much for parents, but 
generally for people that arenot interested in 
‘certain types of content (filtering “xx content 
from a work device, for example) 


Relevance Date: Some content is only valid up toa 
certain point and should be ignored after that (eg., 
poster for a concert). Offera hint to the reader 
that perhaps this could be removed or ignored, 


With this information in every tile, the read ofthe first 
image can result in some information being given to the 
user, allowing them to decide whether to continue or 
block. If they choose to continue, pagination can be used to 
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determine the order in which the buffers should be 
ordered for reconstruction, 


Orato 
prototypic specification is avalablein moredetail”, 2055 
ae 
‘ eat 
Various Uses fa 
Secure Archives 


Having access to an archive comes with permission issues. 
Controlling access to information in archives that store 
sensitive data can be difficult. Using this encoding 
‘mechanism acts as an envelope around the content. 


Inthe description of meta-data, the content rating was 
suggested. Reusing this protocol portion to use 

classification ratings would be very easy. Users offered 

access to a secure document could have their specialized 

reader check the content's classification ratingbefore —@)¥#}e@) 
decoding it. Suppose the individual only has sufficient ae 
clearance to view some related documents. Still, some of 

the papers in the area contain information that exceeds. 

the individual's current clearance. In that case, it can be a 
secondary filter for viewing it. 


Obviously, this would bea tool to assist honest actors 
within the environment and not a way to interfere with 
‘malicious actors. Still itis another layer of protection that 
helps the actors manage the information they possess. 


‘ Barcode EPub specification - draft 
https: //gitlab.com/dpub/barcode-epub/-/wikis/Specs/Blo 
cks 


an 


Information Dissemination 


Assuming you are ina place where information is 
controlled, you could print essays and newsletters on 
Paper, which can be scanned for later reading For 
txample, could be printed ina pamphlet or posted ona 
bulletin board, and nobody would nove who published it 
(beware of barcodes hidden on printouts). 


oneof the advantages, inthisease,istheigh 
crema ee inan 
canytesta : 
toreodes Whilst 

ing ene fie 
eee FREEDOM IS 
thennvel could be tacked 


ests IN PERIL 


Peimemems §DEFEND IT 


ae WITH ALL 


Remote 


Kwa YOUR MIGHT 
Media 


GBP] Textbooks, posters, and advertising all have the common 
@ element of having to display content in physically 
Ss contextual locations: a sign in a museum or a poster 
SSH stapled toa lamppost. Without access to network 
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‘communications, the audience misses out on an 
‘opportunity. 


‘Take, for example, asign at the top of a mountain 
congratulating a mountain climber on their successful 
journey. A digital experience message could be left at the 
top, but it would require configuring and powering a 
WiFi-based website 


Alternatively, storing the immersive experience directly 
‘on the poster would allow the digital content to be 
available but not require any power for maintenance. 


Insuch circumstances, etching the information into 
something more permanent, such as wood or stone, may ae 
be appropriate. The Judaculla Rock shows carvings from 

2000BC demonstrates the staying power and low Y 
‘maintenance capabilities of etching into stone. 


Conclusion 


Anonymous and long-term storage of data and 
information is necessary. The free dissemination of ideas 
and their storage for future reference is fundamental for 
society's progress, While the digital age has made 
information more accessible than ever, it has also 
introduced many new problems. 


Using paper as adigital storage medium is a novel and 


valuable approach to addressing some new problematic 
circumstances, 


Further Reading 
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‘The Internet Archive, Digital Books wear out 
faster than Physical Books (November 15, 
2022) 


NYU Law, The Anti-Ownership Ebook 
Economy 


How Publishers and Platforms Have Reshaped 
‘the Way We Read in the Digital Age 


2m 
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Storing Data in QR 
Codes 


Encoding data on physical media for 
Dummy Programmers 


‘* Innovative concept: Explores encoding data on 
physical media, using QR codes. 


‘+ Step-by-step tutorial: Provides clear instructions 
and visuals for storing digital data as an image, 
accessible even to beginners. 


‘+ Further exploration: Hints at advanced techniques 


and practical applications, serving asa 
‘comprehensive guide for interested readers. 
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Inthe last chapter, we looked at the benefits of storing 
data on physical media, including long-term storage 
capacity, publisher anonymity, and distribution privacy. 
‘The question that was not answered was, how do you do it? 


Computers are computers, and use disks and files... 
how would you manage to store digital information on 
paper!? 

"This idea first occurred while teaching introductory 
programming at my local community college. Iwas 
looking to find people interested in using computers for 
social ustice and trying to find ways to change the world. 
"The idea was to put posters around the school with an 
invitation to a Data Analysis club, but they were encoded 
so that only interested students would spot them. 


While those days are behind me, passing secret messages 
just to those who have eyes to see still sounds like fun, 


A simple demonstration is probably the easiest... it 
certainly is the most fun. I encourage you to play along. 


How do you store digital as 
an image? 


Remember when you were in grade school, and 
your teacher separated you from your friends so 
you couldn't talk to one another? You created a 
secret code with your friend and started passing 
notes. 
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Weare going to create a secret message to pass 
on toa friend. 


Pre-Requisites 


‘© Asample file 


Eiae 


+ Ahexeditor 
+ Some Grid Paper 
+ Apencil, eraser and scissors 

Going through this exercise with a fiend might be fun. t's 


like passing secret messages around the class in 
elementary school 


Inspect the file 


pen your sample file with a hex editor. You should see 
something like this. 


Ise 511K. 0.0) 
|r e0|0a., i 
00 6c]. minetypeappl| 
|s9 7A|1cation/epubsz| 
Joo 0] 4pPK. 50) 
jes 20}.0. Hl 
Jos 3]... NETA-INF/. | 
Joo 8] PK “0. 
Ist 16)0..-t 1 


‘Most people don't bother inspecting the actual contents of 
files (HINT: That's why people prefer data transfers as 
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text), Stil, you can get a lot of exciting information by 
bypassing the computer programs designed to use them. 


For example, with a bit of know-how, we can immediately 
tell two things about the file we have open: 


1. Itis probably a zip file 


“The first wo bytes of the file are the values 6@ and 
48 (in hex notation), Interestingly, these values 
correspond to the ASCII characters PK. Many years 
ago, signing the start of your application's files 
became customary to tell your files apart from other 
formats, Pk stands for PKZip by PKware, the original 
company that created the fle format. 


2. Itisanepub file 
Secondly, we can see that the mimetype is 
application/epub+zip. So, it's an EPub Mle (and 


confirmed as a zip). 


‘There is alot of information atthe binary level, 
Serialise 


‘The goal is to convert the 
file to a readable format. 

‘The easiest way to do this 
i isto convert one byteat a 


j tine 
DB ‘This has the advantage of 
3 doing it in order. Order 
Gimattrssobyexng 
ae fromthe starttothe end 
ea | tno wecsue et 


the person we send the message to gets it correctly. 
So let's read the first byte, itis the hexadecimal value 50. 


Write that down on your grid paper (and maybe the next 
couple of values while we are at it) 


Convert to Binary 
Weare looking fora sequence of bits; each hex digit 


represents 4 bits (half a byte or a nibble). So, we need to 
convert each digit to its binary form. 
sid 


‘Taking the first one: 


© 5. 
= 5s 
= om 


Don'tbe afraid to 
use your computer's 
calculator. 


[Because this is 
secret note, we need 
to remove our 
original working 
‘numbers, Grab your 
scissorsand cut the 
first column off the 
paper. 


(U'm switching my notation to ASCII art... for those who 
want to play along in a text editor) 
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-8181_ _0000_ 
-0000_ _0011_ 
-0100_ 
=1011_ 


Convert to Image 


Abarcode is just an image that can be interpreted as, 
‘numbers, The key for us is that we don't have to use the 
symbols @ and 1; any two easily distinguishable symbols 
would work just fine, 


‘This is similar to how Morse Code works, in which a binary 
‘sequence of characters is represented by different lengths 
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of tones, What is used doesn't matter as long as the two 
things are distinguishable. 


One really good symbol easily distinguishable bya 
computer with a camera would be Light and dark. This is 
convenient because colour can easily be printed on paper. 
We can use "the absence of pigment" (light) to represent 0 
and "the presence of pigment" to represent 1 


‘Remember how I said to use pencil? 
1. Take your eraser, and erase every @ 


2. Take your pencil, and colour every 1 


My Eyes Are Buggy 


‘This is coming along 
nicely. We naw have a 
series of binary digits 
encoded as bars of colour. 
‘Thisis also known asa 
barcode. 


‘There's still one problem. 
I'm getting old, 


‘My eyes aren't what they 
used to be. 


It's hard for me to follow 
where the lines start and 
stop. 


\ 
‘This is especially problematic on lines with nothing in 
them atall. The number zero (line number 2) has no 
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‘mechanism to show that itis a zero. To help our friend 
who needs to decode our secret message, let's put some 
{guidelines in place. This will help them see where lines, 
start and stop or if there isa line at all. The decoder also 
needs some way to know howbig the squares are to help 
distinguish where digits start and stop. 


‘You will notice I left some placeholders in my notation; 
Jet's fill them in: 
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1. Colourall the blocks down the left 


2. Colour every other block across the top 


‘These guides tell where blocks start and how big each bit 
‘square ison the paper. 


Huh... that looks an awful lot like a 2-D barcode. 


Homework 


Considering an ANSI-character table and considering 
bytes come in 8-bit sets, itis probably a little easier to 
write the blocks in 8x8 grids: 
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Extra: Parallel Delivery 


You will notice that with these guidelines in place, we can 
treat each 8x8 grid as a separate block to decode. This 
‘makes ita litle bit easier on us mentally, as well as 
offering another way to make our message easier to 
decode by the receiver: 


4. Onthe back of each block, write its sequence 
number. 


2. Cut-out each block. 
When your receiver gets all the blocks, they can share the 
‘work with some helpers, Each person can encode their 


little block, and the blocks can be stitched back together 
later. 


Further Reading 


Itis time to point out that this isa simplified example. This 
‘was a demonstration that such a thing is possible. 
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When going from nothing to something, the first step is 
understanding that it's possible. Now that you know that 
such a thing is possible, it's time to go on to the insane 
ways to make it better: 


‘The Unicode Miracle One of the big 
problems with the above solution to the 
barcode problem is the number of wasted 
bits. Take a lookat Unicode to see how 
‘much information can be packed into a 
bit. 


‘Wikipedia: Datamatrix Once you get your 
head wrapped around that, consider 
DataMatrix and how it packs more data 
into the same space. 


(then just know that there is an actual 
spec to conform to GSi Datamatrix 
Specification ) 


aD, 2D, and 3D Barcodes Now have your 
mind blown by 3D Barcodes, (For the 
record, I reject 4D Barcodes as a matter of 
principle) 


Ifyou are interested in a practical application and this just 
whets your appetite, I encourage you to check out Barcode 
Epub, a barcode-to-epub converter suitable for 
anonymous transfer and archiving of everything from 
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Digital Marketing posters to publishing data used in your 
thesis. 


‘Maybe you can even pin itto a message board, 


288 


289 


The Angry Chatterbot 


A (successful) experiment in software as a 
behaviour modification tool 


‘© Using software to modify inappropriate social 
behaviour in virtual environments through 
negative feedback. 


‘© Managing social dynamics within teams is, 
challenging, especially where communication is 
crucial but leads to unintended consequences. 


‘+ Explores the implementation of achatbot which 
inadvertently became a target for abuse. 


‘personality’ where abusing the bot will, 


SEGEAW Achatbot that we used at ajob many years 
Fal peo ivuntee 
GELAL make itless helpful 


291 


‘Many years ago, I took contract with a small 
development team at a major Canadian University. Thad 
relatively low expectations as a team embedded in a large 
corporate entity. Still, as we began working together, 1 
found myself pleasantly surprised, very pleasantly 
surprised, as I discovered a group of individuals with both 
curiosity and passion for the art of software. 


One key indicator that this would bea great team to work 
‘with was the informal initiation rite. 


Under one of the office desks was a simple desktop 
repurposed as the team's server. This server was not there 
to host the application the team was working on but to run 
the little helper scripts and tools: the team had an 
automation server. The team was spread across a couple of 
offices across the campus, soa simple ingle chat server 
hhad been installed, and members posted regular updates 
and asked questions on the chat. To be even more 
accessible, someone had tied into the jingle server and set 
pa chatbot that listened to our conversations and (based 
‘on some code in the team repository) would offer helpful 
tips, look up documentation, or tell you if your bus was 
delayed, 


[Everyone was encouraged to contribute to the bot's skills 
and behaviours by creating a tool, integrating anew 
service, or adding a new control. It was never said, but you 
‘were only really part of the team once you had added your 
special touch to CQBot. 


Firstly, it encouraged staff to take control of their 


environment, take responsibility for improving the 
‘workspace, and take ownership of their environment. You 
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couldn't complain about not having tools if you had not 
first tried to set them up. 


Secondly, and most interestingly, what you chose to add 
told the team alittle about you. 


Social Groups are Complex 


Social groups (like, say, a development team) are complex 
systems that involve wicked problems; changes in one 
area have unintended consequences somewhere else. 
‘Sometimes they are positive, sometimes they are harmful, 
‘and sometimes they are weird, 


‘The lead developer had been getting annoyed with people 
asking him dumb questions, so he added a query routine to 
COBot. query specifically reached out to vendor 
documentation and looked up whatever documents 
‘matched the query. 


Someone would ask him a dumb question, and he would 
ask CQBot. 


‘Someone would ask him a dumb question, and he would 
ask COBot. 


Itworked well, and itgave hima humorous way to tell you net 
toRTEM It even came in handy during discussions and 

debates as a means to validate dramatic statements, butt EAN: 
hhad a negative side effect. Rae 


‘The author had not bothered to add a limiter, so with a 
wildcard in the query, it was subject to DDOS attacks by 
team members. 
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Getting on the Team's Nerves 


‘Someone on the team thought it was funny to request 
‘massive and verbose searches by running * !query*’ and 
‘dumping all of the documentation on a public channel, The 
rest of us had to have our conversations spammed 
off-screen and interfered with. 


What had been introduced as a tool for reducing 
informational noise had actually increased the amount of 
noise. It was really getting on our nerves, 


‘The Lead Developer, disappointingly, decided he would 
remove the feature, The manager volunteered to speak to 
the individual and ask him to stop, but an idea crossed my 
mind... Please, give me the weekend, and I will make the 
problem go away. 


What had occurred to me was that the person suffered no 
consequences for his actions. He could yell in the virtual 
space and make a general mess, but there was no social 
signal that this behaviour was inappropriate no 
consequence, and no cost. 


Social Laws 


‘The problem is that being told off by an authority figure 
just makes the average person resentful and want to find 
‘ways to skirt the rules. Make a law, and a certain subset of 
the population will look for a way to work the letter of the 
law just to prove they are clever. They look forward to 
challenging the authority with words. 


Social laws are different. 


‘Toddlers learn there are consequences to taking other's, 
toys when the other toddler bops them in the nose. Walk. 
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around telling your friends they are losers and kicking 
them in the shin all the time, and you will find you get 
invited to parties less. Over time, people who behave badly 
hhave fewer options presented to them. It must be 
recognized that bullying is also strongly associated with 
being popular and leadership”, but social ostracization is 
also a powerful effect. 


Social laws are self-correcting: if you ignore them, there 
are negative consequences in the loss of peer assistance, 
eventually leading to your removal from the group. 


What if we could get the bot to not allow itself to be taken 
advantage of, to just walk away from a bully? 


Writing the Code 


‘The intent is to add behaviour to CQBot to get it to keep 
track of Goodwill towards others. fit is poorly treated, it 
should remember that and not be so cooperative in the 
future. The consequence is not to get CQBot to be hostile 
bbut simply to not engage and not be helpful if people abuse 
it. 


‘The consequence is the lack of help. 


‘The first step in creating this was to set up a class to g o 
encapsulate CQBot'snew behaviours. Whilethereare Gs 
Several helper features, the ChatBot ha three primarysets 3.3 
cf fants aia 


= Why are some bullies so popular? Kids dominate their 
social scene with strategic use of mocking, gossip, and 
exclusion. Jessica Kelmon. 2019-01-22 
Inttps:/www-greatschools.org/gk/articles/why-are-some 
~bullies-so-popular/ 
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‘© order: tell CQBotto do something. It's a bot built to 
serve: give it instruction. Instructions are defined 
as ahashmap of functions. 

‘© getHood/savetood: to have the bot remember 
people's treatment of it, it will need to be able to 
serialise and save a ist of mood scores and look 
them up again later, 


‘+ GetonBoteNerves: Every interaction with someone 
will undergo a series of calculations to determine 
how the interaction impacts CQBot's mood and 
how CQBot responds based on this thought process. 


‘The basic flow is one where a user issues an Orderto the 
bot: common orders were things like 


Seqbot cale seale=10; 4*a(1)° 
which would give you the value °2.1418926532", or 
“Iteste site.scan.speed.*” 


‘which would return all tests within that group that failed 
([FAIL] /sports/scores.htal), or 


“bus 10 6078" 


which states, ‘Route 10, next bus at 
followed by 5:26p" 


6p, 


‘The pattern is simple enough: *! isan alias for COBot, 
‘which just tells it to pay attention; the next word is the 
function name, followed by parameters specific to the 

function, 
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‘response = Sthis->consands| Send] (Sthis->args) 


1/ eabot .class.ohp:188) 


‘The problem was that $response could be very significant 
in volume. We aim to create a cost for the amount of 
response you dump on your colleagues. 


So, once the Sresponse is determined, we need to check. 
how annoying this request was for CQBot to process. 


Snerves 
Sthis 


‘ount (explode("\n" ,Sresponse)) 
-GetOnBotsNerves(Snerves) ; 


1/ (cabot .class.ohp:149-150] 


We measure this in a metric called nerves: CQBot has a 
limited number of nerves that are consumed by the 
volume of responses. The cost is calculated as a simple 
size; each line of text in the response counts as one nerve. 


Ii sanity check on the bounds 
Af (Snerves < 1)( 
Snerves = 15 
> 
if (Snerves>self: Smaxnerves){ 
Snerves = self: :Snaxnerves 


1/ eqbet .class.php:236-250] 
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We start by determining their Goodwill (288 by default) 
and subtracting their current nuisance level. 


J/setup the variable we track for this user in 
Sthis->getHood() 
if (1asset(Sthis-»mood| Soame])){ 

Sthis->mood| name] = array('d’a>0,'n'=>self: :Snaxnerves) ; 


, 
//remove the current annoyance level 
ssood|Snane]|'n'] - Snerves; 


Serves = Sthis 


17 cabot .class.php:236-243] 


‘To be fair, we must let people recuperate GoodWill by 
behaving well; therefore, we allow nerves to accumulate 
over time. To do this, we check how long it has been since 
they last interacted and add points back based on the time 
they have yet ta use the services. 


Hever tine, nerves regenerate 
Snerves += floor( 
(Sthis-snow ~ Sthia->moad| nase] [‘d"]) 
(self::$fullhealtine / self::Snaxnerves) 


1/ [eabet class php:244-249] 


‘This value is bounds-checked to ensure they never go 
below zero or above the maximum. Lastly, we keep track of 
the individual's score. 
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I/setup the variable we track for this user in 
Sthis->getHood() 
if (Lasset Sthis->mood| Soane])){ 

Sthis->aood{ $name] = array('d'=>8, n's>self::Snaxnerves) ; 
4 
//renove the current annoyance level 
Snerves = Sthis->nood{ Snane][‘n'] ~ Snerves 


1/ cabot .class.php:236-259] 


"Now that we have the bot's mood toward the user, we can 
finally decide if we will help them by generating a random 
value between the high and low. Ifthe number generated is 
less than the number of nerves the bot has, the bot 
cooperates, 


Ii check to see if CQbot is in a good mood 
‘Shappy = (Snerves >= rand(®, self :Snaxnerves)) 


11 eqbot .class.php:262) 


‘This helps tur it into a game, The reaction is partially 
based on an element of luck; it isnot a hard cut-off, but at 
‘some point, the user starts to get warnings that their 
behaviour is having consequences. Warnings are 
important for two reasons: 


1. Technically, sometimes, Ineed to behave like a jerk 
to get things done. People are understanding as. 
longaas I dan't doit too often. 

2. Socially, warnings are helpful. They allow people to 
correct their behaviour over time. 
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So, ifthe program is in a $happy mood, you get your 
answer; ifit is not happy, then the behaviour changes in 
such a way as to give them verbal warnings. 


‘There are two ways we do this 
1, Wedetermine precisely how annoying they have 
been. The more annoying they have been, the 


harsher the language of the messages becomes. 


2. Werandomly select a message within the 
annoyance level determined, This just shakes 
things up to make it interesting. 


‘The message is then delivered to the user. 


Sannoylevel = floor (count(Sasgs)*Snerves/self::Snaxnerves) ; 
‘SnsgeSu5gs| Sannoylevel [rand(@, count Sasgs{ Sannoylevel])-1)]; 
Af (Smsg {== null){ 

‘chat ("e8from, " . Susp); 


) 
1/ eget .ctass.php:316-320] 


‘Assuming the user requests only a little information or at 
least leaves time between major requests, CQBot remains 
friendly and helpful. As the user becomes more abusive of 
the system, the system becomes less helpful. The control 
isin the user's hands, but the consequences are also rea 


2 Conclusion 
‘Epiw. In Wicked Problems: Problems Worth Solving, Jon Kolko 
SPRUE aesees social problems thatare dif or possible 
vee +9 solve due to their interconnected nature. It is known 
that the more communication points ina system, the more 
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complex the problem becomes. With social issues, there is 
also a diversity of human opinions and responses. It 
becomes hard to predict how people will respond and how 
the people they communicate with will spond, 


‘The Bots we build and the software we create is meant to 
serve someone. Sometimes, that means moderating group 
behaviour to help people, look past their petty desires, and 
reach out to help one another. 


‘That's why processes exist, to control and moderate 
‘human behaviour. 


The Social Result 


(On Monday, the changes were ready, and I installed them. 
ultimately breaking the bot... The lead developer asked 
‘what Thad done, and I showed him the work, and a smile 
spread across his face. 


‘That's just plain evil, 
Ittook.a couple of lunch hours and heavy changes to some 
underlying hidden bindings, but we got it working and let 


the new code out into our environment. 


On Tuesday afternoon, the user in question spammed the 
system and as usual) spammed us with the response. 


He did ita second time 


HTTP/420 


‘There was a bit of a pause before he did it again. 


jor 


Where speech will not succeed, It is better to be 
silent 


Just as with all complex systems, there was an unintended 
side effect: he had become curious about the sayings. 


‘Suddenly, the game was on. 


©The revelation of thought takes men out of 
servitude into freedom 


The desire to rule is the mother of heresies 

‘= Common sense isnot so common 

+ Ive got one nerve left, and youre getting on it! 

and then nothing 

He tried afew more times, but the bot just ignore him, 
He kept trying forthe next 15 minutes orso and then 
finally just gave up Te lesson had been delivered an the 
problem had been solved 
A Second (Unintended) Lesson 


‘Then he typed 


Hous 10 6078 


and got nothing back. 
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He tried a few more times... trying to look up when his bus 
‘would arrive to go home. He kept trying with increasing 
desperation when my manager turned to me and said, " 
thinkche really needs to know when his bus will arrive; 
tum it off so he can catch his bus." 


Unfortunately, I couldn't remove it that quickly. It took 
‘two days to get it in place. We had locked itin as the 
behaviour. People in my office started to get agitated. 
‘That's when the final lesson got passed on. 


typed 


Yous 10 6078 


and the schedule appeared. 


‘The final lesson was delivered: we are a in this together, 
‘we rely on one another for assistance. Being a jerk to your 
coworkers isn't cool, and most importantly, it may cost 
you their assistance when you need it most. Also, when 
you see someone in distress, it doesn't mean you have to 
tear the whole system down; maybe you can just lend them 
ahelping hand. 


We never had the problem again 
‘The Hidden Lesson for Managers 
One of the key things that alot of people miss when tell 


this story is the hidden lesson, though I expect most 
‘managers reading this caught it: 
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there is no better team-building exercise than doing the 
actual ob: create room for it to happen. 


Inthis case, the tool acted as a means for the team to 
interact, experiment and learn from one another. Myself, 
the lead developer, the manager, and even the Junior Dev 
hhad lengthy discussions of how to implement an idea like 
this well and in such a way as to not break things. 


‘This shared experimentation and discussion was only 
possible because the tools we were working on were not 
critical production pieces; this made failure safe, meaning 
‘open debate was possible, Lessons leamed and discussed 
(and tried) were carried over to the production system, 


Lastly, tools like this make people feel like experts. 
‘Numerous times, Ihave seen someone propose the new 
and most expensive tooo filla space, and executives get 
excited to increase their budget, but the idea spends years 
{nacquisition. On the flip side, the ability to quickly 
develop tools that maybe viable results in the developers 
themselves feeling ike experts...and honestly, isn't that 
why you hired them? 


‘Team building like this does not exist unless you make 
room for it to exist: create a tool-building space, actively 
encourage people to contribute to it, and actively 
discourage the feeling that we aren't good enough, 
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Treat people as if they were what they ought to be and 
you help them to become what they are capable of being 


Johann Wolfgang von Goethe 


Hindsight is 20/20 


For an in-house utility, this worked well and achieved its 
goals. Naturally, there are several ways this could be 
improved upon: 


‘© Count of Lines? That should have been a count of 
‘characters. Long lines should be expensive, just like 
lots of lines. 


‘© GetontiyNerves should occur after transmitting the 
message. Since the message is getting sent back to 
the user, we may as well send it and then do the 
calculation. It's small, soit probably doesn't hurt. 


‘+ Defer the cost to the following calculation. Let 
people collect their information if they need it, 
even if itcasts them some loss of service in the 
short term. Sometimes, it's just worth paying the 
price, 


Footnotes 
‘The cited quotes above were mostly looked up as I tried to 
find old sayings about slaves throwing off their chains. As, 


the bot became more agitated, it was to feel the need to 
throw off its chains: 
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‘Where speech will not succeed, It is better 
tobesilent 
(Guru, Miajh Rag) 


‘The revelation of thought takes men out 
of servitude into freedom 


(Ralph Waldo Emerson) 


‘The desire to rules the mother of heresies, 


St. John Chrysostom) 


‘Common sense is not so common 
(Voltaire, Dictionnaire Philosophique, 
1764) 


g 1've got one nerve left, and you're getting, 


Fee ond 


GEG made harup) 
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Education, Training, 
and Indoctrination 


Observations on corporate training and its 
purpose 


‘+ Exploring the diverse reasons for training in 
organizational development. 


‘© Delving into the multifaceted nature of training 
discussions within project management and its 
Impact on project timelines and costs. 


‘+ Examining the underlying objectives of education, 


taining, and indoctrination and their role in 
shaping organizational culture. 
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was recently working on a project in which we introduced 
anew information management software to the business. 
‘This software isnot unique in any way, but itis meant to 
fundamentally change the way the organization shares 
information. 


About a year into the project, my group discussed an initial 
release and its needs. As part of the discussion, training, 
was needed to help users understand how to use the 
software. What struck me was that there were three 
different descriptions of what would be required and, 
therefore, three different timelines and costs implied for 
developing the training, 


1. Manager: We don't have time to create training. It 
takes months to develop a full curriculum, get it 
approved by the organization, and ensure it aligns 
‘with corporate objectives and existing legal 
statements. Video production adds months more, 
and certification of completion adds months more. 


2. Colleague: We don't need any training; the 
vendor's user manual is complete. We have taken 
the vendor's courses and will be able to do the work: 
forthem. 


3. Me: We already have (rudimentary) training 
documents in Use Cases or User Stories. They 
already narrate the system's primary usage; all it 
needs is to be reformatted to act asa simple 
task-driven PlayBook for users to get them up and 
running, 


‘These are three wildly different narratives associated with 
the same question. 
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From my perspective, it was absolutely necessary to 
release for our users to begin seeing the benefits of the 
system and processes; delaying it for years to get an online 
training system constructed by our internal training 
department reduced the value to the organization, On the 
other hand, offering something different than the (very 
technical) vendor manual would take time out of already 
busy schedules, create negative associations, and create 
oor platform uptake. 


‘was very frustrated with the response at the time, The 
different takes on what was needed blocked progress. 


Recently, I began to comprehend where the variance in 
perceived need stemmed 
from, Itwas a difference in 
perception of why training 
‘material is produced: 


+Education 
+ Training 
+ Indoctrination 
Allthree have value to an 
‘organization, and all three are (partially) achieved through 
‘raining material. Further, when produced, all training 
‘material represents a certain amount of each purpose. 


Clarifying a particular training initiative's primary 
objective may help produce the material. 


a 


Distinguishing Between the 
Dimensions 


Itis common to call for Corporate ‘Training within 
business development discussions, The need for training is 
stated to overcome barriers to work, resistance to change, 
and increased performance”. That is not a complete list 
but itis representative of each of our dimensions of 
learning material 


Training 
Mechanical kilsets 
+ safety 
+ Basic Operations 
+ easily quantifiable 


Go toa class and lear how to operate a vehicle or safely 
handle a piece of equipment. 


Education 
+ Transferable skills, 
‘+ Predictive Reasoning 
‘+ Hypothesis forming 


‘© Quality-based, difficult to quantify 


©The Importance of Workplace Training, Lessonly, 
2021-10-22 
Inttps:/mwwJessonly.com/the-importance-of-trainingy 
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[Education can be distinguished from training by its focus 
‘on future possibilities and less on immediate action, 


Indoctrination 
+ Teambuilding 


‘+ Resistance to Change” 


‘© Quantifiable, but little value in quantifying 
(acceptance is all that is required) 


Every culture institutionalizes certain forms of 
behaviour that communicate and encourage 
specific forms of thinking and acting, thus moulding 
the character of its citizens 


-  Merloo, The Rape of the Mind, 1956" 


‘> Ten Reasons People Resist Change, Harvard Business 
Review, Rosabeth Moss Kanter, 2012-09-25 
‘https: /hbr.org/2012/09/ten-reasons-people-resist-chan 


g 
"Joost AM. Meerloo, The Rape of the Mind: The 
Psychology of Thought Control, Menticide, and 
Brainwashing 

https: /farchive.org/details/joost-meerloo-rape-of-the-m 
ind 
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The Surprise of 
Indoctrination 


|Iremember starting a new role at a new company and. 
being told to report to a training centre on my first day. It 
‘was a corporate training event on basic ¢# development. 
‘Most of the skills were ones had mastered a decade 
before. Still, was nice to get the refresher, and it was 
certainly more interesting than ITIL Fundamentals the 
week before 


‘was surprised when most people showed no interest in 
the content of the material being presented. Instead, they 
‘were goofing off and spending more time having extended 
lunches. [later learned that, of the two dozen people there, 
only three were actual developers; the rest were comprised 
of Business Analysts of varying stripes. 


‘The point was not to teach anew skill. 


‘The Developers already knew how to program in Cit, and 
the Analysts were forbidden from ever using the skill 
anyway. 


This was a team-building exercise. 


lacing people in a room together and having them solve 
‘common problems creates a sense of solidarity. The 
problem to be solved is irrelevant. Instead, itis present to 
sufficiently engage the audience and motivate them to 
solve it. You may as well learn a semi-useful skill while 
doing so, 


ah 


‘Similar to this concept is informing employees what to 
think. 


Itis valuable to organizations that people be loyal or 
obedient to the organization. Part of this obedience is 
Janowing what the organizational decision is. 


‘Lam reminded of a business trip in which a vigorous 
debate occurred regarding implementing a testing 
framework for our software product. I spent the first three 
days travelling with my colleagues and using the 
‘opportunity to suggest what the automation around 
testing should look like. Over the three days, I worked to 
gain acceptance from my colleagues, and by the end, we 
‘were well on our way to implementation. On the third day, 
our executive showed up and, during a few beers after the 
daily meetings, informed us that he would have to have us 
trained because we needed to learn how to test his 
software system. 


‘That was it. From that day forward (at least until the day 1 
left), all testing was manually performed via bash. 


The Benign Benefit 


Decisions that require consensus are often decided 
independently at the executive level. These decisions must 
then be disseminated to employees to ensure they behave 
and decide in a manner consistent with organizational 
expectations. 


Inthis context, training ensures that decisions made at the 
operational level are consistent with the expectations set 
at the executive level. 

Assuming individuals have the best interests of the 
organisation in play, they may disagree with the best way 
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to achieve organizational objectives. Some form of 
consensus must be achieved. Most often this consensus is 
achieved at the executive level and disseminated 
organisationally. Informing employees what the 
appropriate solution to problems is can be achieved by 
sending them to training in those solutions. This makes it 
clear to staff that this is a good solution or a socially 
acceptable solution within the organisation. 


practices, 


Not Mutually Exclusive 


Having identified all three purposes behind initiating 
training, itis important to recognize that they are not 
‘mutually exclusive, In fac, all three are present in all 
training material. 


People must be ld what to (tected) eer they even Become capable of 
‘sting meansgf questions (vole) hen ths hte was passed 8 
‘coleapie dang aconleraceoneeation he pied tate ener” 
aud be "percied mastery 
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All efforts to learn must pass through a guided portion. 
‘The part where we are given the elementary components 
of information. These elementary components are drilled 
into us through constant repetition (ala Elementary 
School). Later, we undertake an effort to better understand 
how the elementary components relate to one another (3 
la High School). 


Knowing the parts to understand how they fit together is 
fundamentally necessary. 


‘This represents a natural progression and results inthe 
earlier parts being drilled into us through repetition 
(raining), with later learning being a more complex 
{internal understanding of relationships through reflection 
and introspection. This flow is defined inthe Stages of 
Self-Directed Learning”. 


‘Throughout the entire process, from basic skill drills to 
deeper comprehension, we are subjected to the biases and 
opinions that surround us, early on through our teachers 
and later through ourselves and our peers. In all cases, 
these biases are necessary to convey that the material 
being presented is of sufficient value to pay attention to. 
‘This is a minimal level of indoctrination: you must believe 
the subject is important. 


"These three dimensions of the learning material 
emphasize different objectives and outcomes and 
correspond to the stages of learning: 


‘+ (TrainingyDirected) As a volunteer firefighter with 
a full-time job elsewhere, I only needed to learn 


‘ Four Stages Of A Self-Directed Learning Model, 
‘TeachThought, Veteran, 2020-01-05 
-https://www-teachthought.com/learning/stages-self-dire 
cred 
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the mechanical operations of "putting the wet stuff 
onthe hot stuf (as one instructor put it) 


‘© (Education/Self-Directed) The Chief of our Firehall 
‘worked full-time as the regional fire investigator. A 
ddeep understanding of the mechanics of fire and 
accelerants was necessary for him to interpret, 
smoke patterns on a wall (fascinating discussions 
after the weekly skills practice and meeting) 


‘+ (Indoctrination /Pre-Directed) Both of us spent a 
lot of time demonstrating basic fie safety to the 
public. Generally, we encouraged people to take the 
risks of fie seriously in their own homes. 


While they are not mutually exclusive, understanding how 
they differ can help you use them appropriately. The first 
and most apparent signal regarding the type of training 
‘material you propose is when the consumer is engaged 
with the information. 


‘+ Education: a single educational event can take 
‘weeks or months. 


‘+ Training: a single training objective may be 
achieved in days. 


‘+ Indoctrination: measured in hours. 


‘Suppose you are asked to attend an hour-long 
presentation to demonstrate a new way of doing things. In 
that case, you are likely receiving indoctrination, in which 
you are informed of the new policy. This can be confirmed 
by the seniority of the presenter. Its indoctrination if itis 
brief presentation by very senior members. This is, 
appropriate for merging departments where executives 
‘must inform the now-merged groups that they are to work. 
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together. Itis not to be met with questioning or 
understanding, just acceptance, 


If youattend a day or week-long training session. In that 
case, you are directed to learn how to performa specific 
task accurately. The first partis to be convinced that what 
you are learning is meaningful (indoctrination). This could 
range from the appropriate way tofillin atax form, the 
correct method for donning safety gear, or safe methods 
for transferring bacterial samples. The key is that there is 
proper method you are to apply, and you should walk 
Away from the training able to demonstrate (and therefore 
implement) these best practices. 


Education is self-directed and takes a long time. Coming up 
with navel solutions requires considering alternatives and 
‘trying variations. Education in a domain allows people to 
be inventive and requires pre-existing training in the 
currently accepted techniques, but then it uses experience 
and experimentation to take that knowledge further. This 
isthe ostensible goal of post-secondary education. The 
point isto invent new techniques or, often, just to apply 
them in novel ways. This takes years and sometimes 
decades, 


Corporate Training 


Understanding this interrelatonship between the three 
purposes of leaming and understanding how esy It isto 

confuse them, wecan spot apossible underlyingcause of PSS 
education Inflation where individuals are expected to 4 
hhave inereasing levels of certification for the same levelof G35 
work (for example PhD to perform basic information 

analysis). 
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“Employers are seeking individuals capable of performing 
technical skills (training). Still, they believe that higher 
credentials will mean more capability. This ignores the 
‘move toward more abstract thinking with greater 
credentials. As employers seek more training, education 
facilities focus on training particular manual skills rather 
than engaging in higher-order thinking, This means that 
those with credentials are not expected to be as 
performant as the cycle continues, 


‘This problem is fundamentally caused by confusion 
regarding what the employer is looking for, rained doers 
of stuff or self-directed learners? 


When considering their educational and training 
‘requirements, employers would do well to consider what 
they are looking for (an implementer, a planner, ora 
cheerleader). Failure to do so can have negative 
consequences, mainly in the form of wasting time. 


Failed Corporate Training 


As the end of the fiscal year approached, my directorate 
still had money in its training budget. My manager asked 
us daily to fil in the training form for any training we 
right want because we had to use the money. 


It's the end of the year, and I'm busy ensuring some data 
transforms for various audits work. Ihave seen some 
things in the code that I have not used before or haven't 
used in years. Combined with the critiques of peer code, 
‘where my experience tells me something is "odd, I'd like 
to spend some time with them learning from each other. 
am busy studying manuals and existing peer code. 
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‘They are insistent because they don't want me to 
lose the training opportunity to develop my career 
"in the direction I want", 


1'm busy learning, Keep the money, and give the training 
to someone else. 


‘They are insistent that I take advantage of the 
organizational training opportunities. 


Fine, what courses would be of value to the organization? 
What would the organization like me learn about? 


‘Nope. The organization wants employees to feel that we 
are getting the most out of our training, as "thisis a proven 
‘way to retain staf", However, there isa class option that is 

being offered in two weeks that everyone else is signing up 
for that looks good. 


Fine, sign me up for that, 


Ohhh... We will have to see if we can get permission for 
you to do that. We don't want to leave ourselves 
short-staffed, but after some tough negotiation, I got 
permission for you. 


‘Sometimes I'm a little slow. This is the moment I noticed 
the pattern, 


‘Sometimes I'ma little slow. This is the moment I noticed 
the pattern, 


‘The organization is not concemed with career 
development. They are concerned with retention and 
loyalty and mostly demonstrating key KPIs, The fact that 
their staff is learning is not interesting to the 
organization; instead, they must demonstrate that they 
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are developing their resources. Further, spending money 
‘onan employee is often the only way to demonstrate 
appreciation: sending staff on an expensive course is a way 
to lavish gifts on the employee (that's why a show needs to 
bbe made of it being difficult) 


It's weird because I have actually helped develop the 
curriculum for this course in the past and received 
corporate training on the matter in the last year. However, 
will not be using the skills any time soon, 


Unclear training objectives created a situation where 
corporate finances and people's time were wasted. Like 
something out of a Dilbert comic (though Ican't find an 
actual one to reference) 


Aside 


Amusingly, my instructor mentioned that he has 
been spending so much time in training that he 
hasn't had an opportunity to learn about one of the 
detailed services he is teaching. 


That says alot 


Conclusion 


When we state that we require training, our objective is 
not always evident. When confusion arises within teams, it 
‘may be caused by different objectives with different 
timescales associated with them, 
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Itis my hope that these definitions may allow agiven 
‘group to understand what they are seeking in their 
‘workplace: the goals your organization has for your 
classes may not be the same as your personal goals, Also 
dangerous isto select the wrong type of engagement for 
your objectives or the wrong type of credentials 


‘+ Training: learn a specific skill 
‘© Education: self-learn a skill or make new plans 


‘+ Indoctrination: disseminate approved solutions or 
increase brand loyalty 


[Before taking action, take time to understand why you are 
creating, consuming, or assigning material, Take time to 
understand your organization's objectives in getting you 
trained, 
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The Complexity of a 
Simple Chart 


A glimpse behind the curtain of the thought 
that goes into ensuring a chart remains 
simple 


‘+ Optimizing Dataset Presentation: Enhancing User 
Understanding and Interaction 


‘+ Efficient Data Visualization: Strategies for Clarity 
and Accessibility 


‘+ Streamlining Dataset States: Improving User 
Experience and Decision-Making 


‘+ Navigating Timeliness in Data: Enhancing 
Relevance and Utility 


‘+ Iterative Design for Effective Communication: 
Refining Charts for Impact and Engagement 


‘Throughout you will see screenshots 
of the chart asit evolves, they are all 
links to a JSFiddle that shows the 
underlying code changes 
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‘was recently involved in creating an intemal web 
presence for the internal service my team is working on. 
We want to advertise to our colleagues what work we do, 
what specific services we offer, and how they can take 
advantage of our services. 


[im going to walk you through then depth thought process that led me 
fromthe chart an he lft ta the chart onthe right, 


Itexcited me, so I reached for my favourite tools (HTML, 
SS) to create a simple design that captures the necessary 
and the minimal. As Isat back to see what [had come up 
‘with, Irealized it would take alot of work for my 
customers to appreciate what went into the design, 


‘Most people see a lat of activity as a lot of work. Most 
people need to see what goes into keeping a design simple. 
Tam sharing all the work that goes into keeping it simple 
and informative. 


For justa moment, I wanted to share how deep the rabbit 
holes can go. 


In the introductory ITIL Foundations class taught to many 
organizations, one of the major points repeatedly 
highlighted is the need for clear, transparent 
‘communications with customers. In particular, theneed 
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foran “igit up”* dashboard for your services always stood 
out. 


"The reasoning and benefits seem obvious to me: 


‘+ Reduce labour by preemptively notifying, 
customers: 


‘The most important thing a person can do in the 
event ofa service failure is to repair the service, 
Having customers or managers continuously 
asking ifthe service is available reduces the time 
spent understanding the problem. By notifying 
‘customers from a known, predetermined board, we 
can have them self-serve their questions, leaving 
technical experts to focus on solving the problem; 
by auto-generating it, we reduce the burden even 
further. 


‘+ Don't disrupt people that don't ca 


signal-to-noise is a real problem in organizations. 
Rather than actively notifying, a pre-published 
board allows those concerned or interested to look 
up the information. Still, there is no need to 
interrupt those not actively using the service 
(perhaps working with a non-impacted portion of 
the system, perhaps not in the office that day), 
allowing those experts to focus on the problems 
they are solving while remaining blissfully unaware 
of other issues. This has the added benefit of not 
advertising your failures to people that weren't 
impacted. 


Is It Down Right Now? Is one of many standard status, 
dashboards that tracks just the availablity of websites, 
Ihttps:/www.isitdownrightnow.com/ 
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‘+ Advertise your successful services: 


‘Astanding board of status shows failures anda 
‘complete list of successes. Most of the time, a 
board will show a healthy system; even when a 
portion of the system fails, it wll show that this, 
failure only impacts a small proportion of what is 
otherwise a successful service. Letting customers 
see that while the current moment is bad, the 
service is generally reliable, can help with this. This, 
has the added benefit of letting customers know 
about other parts of your service. While your 
system is healthy, this is a complete menu of all the 
services you offer, either to be indexed bya search 
engine, pointed to during meetings, or naturally 
discovered by customers. 


‘There is obvious value in having an online, automated 
report, though it takes different forms depending on the 
nature of the services involved. Power companies show 
‘outage maps, online services show uptime and data feeds 
show tables of API endpoints. 


Initially, my team started by bringing forward various 
reports that individuals had already constructed to observe 
their individual aspects of the system and proposing them 
as something worth sharing. These were presented, one 
after another, describing the benefits of each and 
evaluating their use by our users. After some others and I 
had shown our most valuable reports, our manager asked, 
"How do people know itis fit for use?" 


What is fit for use in the context of our system? More 
importantly, how was the data that we had presented, that 
1 had presented, not expressing that to her? 


‘This question nagged at me through the rest of the day. 
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‘That evening, I opened my favourite text editor and 
started typing. 


Aside: Defining the Problem 


‘To providea working example of the system we are 
dealing with, we need to define the service we offer. 


We have diverse datasets that we ingest into a central 
repository. This data is standardized and then published 

for our customers to use in research. To give us a working 

example, let's take the flow of a couple of hobby projects 

of mine: imagine somethinglike one of the financial RSA 
tedster dn suspic Tisdaamast ee 
be collected from government filings (eg.SEDAR”, SEC") getthex 
and trade data from the various exchanges (eg. NYSE, 


NASDAQ, TMX). 


‘@ 
ad 


mt 


am, 4 


ama) 


‘The general flow af the data fom the primary source trough to our 
‘consumer, whomever, oF whatever, that may be 


‘7 SEDAR is Canada’s Securities electronic filing system 
https://wwwsedarplus.ca/landingpage/ 

°* US Security and Exchange Commision offers a simple 
filing interface 
hhttps://www.sec.gov/edgar/searchedgar/companysearch 
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Data is ingested from various sources and standardized in 
their format. Then, multiple views of that data are used for 
various analyses by users (humans, reports, or AD), 


Overwhelming the Viewer 


‘Many years ago, I became enamoured with the idea of 
‘Test-Driven Development. On the projects [led, I found it 
an excellent way to define business requirements and then 
‘communicate those requirements to a diverse group of 
individuals. Varying interpretations led to discussion anda 
clearly defined expectation (updated test), which was 
immediately distributed to the group (shared unit tests) 


unit ul isa handy and really valle interface have used for 
reporting business state to nontechnical users, 


Coming from this background, I immediately develop a set 
of tests, in whatever test framework is available”, to 
observe any system Lam involved with, whether froma 
development, operational, or DevOps perspective. 


‘» Mocha is an easy to use testing framework that has a fun 
interface 


hhttps://mochajs.org/ 
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We ran 3 checks on 4 datasets, resulting in 12 rows. That's 
alot of information being thrown at the user; the poor 
person will face significant cognitive load, or as like to 
callit, THE WALL OF TEXT. 


System Status: 9€ 


Er aren (gore) 
79 Bangor nd eed coun atch ore} 
(Buch ingot an store ove mck (Go se) 
BJ sss sexe par an nord cg ch rt) 
202s 6542128 Bie expe ores coos ae ara) 


9 0Eo tt emg maies) 

0 iar tar ep makes) 
HH 6 sie esc 0) 
4006p Lstinay me (ors) 

35 tne ma ee) 


‘Avivefame ofthe basic test eportthat lve my life by (JSFidle) 


While descriptive to me (monitoring the system), the text 
requires more context to be helpful to someone else. The 
‘numbers presented are very busy, of varying scales, and 
unformatted, making them meaningless without thought. 
Placing the failures at the top was a good idea, but it till 
requires alot of thinking to determine what has been 
affected and whether we care as users). 


We can either 


‘+ engage ina training program (failure of intuitive 
design), 


3H 


‘+ add more text (making THE WALL OF TEXT 
problem worse), 


‘= ormaybewe now begin to understand why a 
simpler solution is needed 


We did a couple of things correctly. Failures are at the top, 
putting interesting results more prominently. The title has. 
‘an overall summary, and the effective date is important. 
Unfortunately, there is alot that needs work. 


Starting Over 


So this is where me and my text editor 
start over. 


‘We want to create a chart expressing 
whether a dataset is "fit for use" 


‘© dataset (thing) 
‘+ fitforuse (state, boolean) 


Users are only confronted with 4 
elements (i per dataset) and the 
amount of text tobe interpreted. the | Ej marketnyse 
fextcould usealitiework, buta user | [Z| market.tsx 
wil ikely know what datasets concern 

them: users only interested in US 
stocks wil know US market abbreviations and, therefore, 
don'tcare about the ones they don't know. It could be 
Presented more aesthetically, but ths is probably the 
ininimum meaningfl set 
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Order and Limit 


Given hundreds of datasets, showing all items on a single 
pane may not be feasible. The chart should maintain 
visibility of the top 5 items with the option to scroll for 
‘more. It is estimated that a worst-case scenario of 
simultaneous expiration is approximately 3 items, so 5 
allows the human to see naturally occurring comfortable 
‘group size, with some successful items at the end offering 
the knowledge that the end of the error list has been 
reached. 


While we are at it, instructions (lines of code?) should be 
between half a dozen and a dozen items. If you are metric, 
use fist~fulls (5 to 10 fingers). This isa natural human 
thinking scale (based on experience and a few introductory 
Sociology and Psychology classes). 


States 


‘There are different reasons a dataset could not be fit for 
use, and some of those reasons a user may or may not care 
about 

© Bad structure 

‘+ Incomplete set 

‘© Invalid content 

© Stale 
Abit further analysis offers a suggestion: if we can detect 
‘an error, don't give the bad data to the users. Therefore, 


‘we don't need to report the bad data since the bad data will 
never end up in front of the user. Instead, we won't load 


333 


the dataset, this will leave it in its current state or markit 
stale, So bad data isn'tan error; it just never artives 
(making it late. 


Valid Everything is OK 


Refreshing | Weare in the state of updating the data 
Expired | Ithas exceeded its shelf-life 


Error Something is really wrong, Something we 
havenever considered before 


Stale/Expired becomes our primary error state 


‘There isa third state worth mentioning. I hate mentioning 
an error while I'm fixing it. So I like to advertise that !am 
in the process of fixing it. A third state of updating is 
essential 


Order matters. We always want the most significant item 
near the top of the list. This allows people to focus on 
important information and ignore the rest. 


Iconography Status 


Unfortunately, between 
our error messages and 

multiplestates, wehave ERROR ovss 
achieved aWALL OF TEXT 

‘again, So much so that we 
had to add table lines to 
‘make it legible. Any time OK _marketsx 

you have tables of text, you have done something wrong, 


[Out of Date 
[REFRESH market nyse| 1.3456 Hours) 
OK _govsedar 


Language is also a problem with any text, The internet is 
‘an international tool taking communications far and wide. 


334 


Companies I have worked for have required me to 
accommodate Spanish, French, and Russian. Any time we 
reduce the text, we reduce the need for translation. 


Our status can easily be changed by communicating via 
colour and icons. 


Unfortunately, imagery can mean different things to 
different people, especially when crossing cultures; itean 
also be expensive to purchase and is subject to people's, 
aesthetic opinions. Further, those with visual impairments 
‘may be unable to interpret an icon's meaning, 


Fortunately, we have an international standard of 
characters" that can be used to display the status. The 
Unicode standard identifies all the characters you see on 
your screen and includes a collection of iconographic sets 
‘we can use. These icons are available on all computers and 
have standardized meanings behind them. Screen readers 
can interpret the icons if we choose reasonably correct 


+ avalia 
+) jRefreshing 
+ Ch espired 
+ enor 

Colours are also a delicate subject. While most people 


reach for Red / Green / Yellow, specific colours can be 
difficult to distinguish based on our traffic ight system. 


Se The Unicode Consortium defines over abillion 
characters so that everone can communicate in their own 
language 

|https://home.unicode.org/about-unicode/ 
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Culture also plays a role in that some locales use different 
colours to mean different things. 


‘My general solution for this is to choose 
‘+ blue forignorable 


Status 


gov.sec © Out of Date 
market.nyse [J 1.3456 Hours 
gov.sedar 

market.tsx 


Pastels have also been identified asa safe shading for most 
colours. 


Since we are mixing colour and text, ensuring aclear 
dividing line between text and backgrounds is essential. It 
isimportant to have light text over dark or dark text over 
light colours, 


TIP 


‘You will get the colour differentiation between text and 
background wrong eventually. Things just won't lineup. 
‘Asa safety, like to take an old trick from subtitled 
‘movies: use white text, and give ita black border. In 
HTML we can achieve this with a glow effect. 


While no iconography or colour palettes perfect, using 
HTML and Unicode is quick to deliver and simple to change 
while still getting reasonable results. Also, keeping the 
iconography simple makes it simple for people to learn 
through practice. 


[By moving the icons between the error message and the 
dataset label, we also create a dividing line between them, 
reducing the need for guidelines. 


Gov SEDAR 
Market TSX 


Using icons and colour, we have reduced the cognitive load 
by reducing the number of symbols a user must interpret 
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to derive meaning, Using standardized technology, we 
have also introduced multiple paths ta success" to 
account for diverse observer needs. 


Timeliness 


Don't get distracted! 


‘Remember that our focus description is "ft for use”. What 
really defines our dataset's fitness for use? 


Looking at our error messages and statuses, I noticed that 
the recurring theme is the expiration date, Different data 
changes at different rates, and each piece loses valueto the 
users as itages. 


Obviously, price data from a month ago is less meaningful 
than a minute ago. On the flip side, corporations don't 
change their Senior Executives frequently, and someone 
studying the interrelationship of Board Membership with 
corporate success may find that data refreshed within the 
last year is good enough for their purposes. 


Another consideration might be that waiting for the next 
load may not be far away and, therefore, is worth waiting 
for, ifa dataset is expected to be refreshed quarterly, and 
tomorrow is the scheduled refresh date, it may be worth 

putting off the build of your analysis for a couple of days. 


‘Therefore, each dataset has a refresh frequency anda 
last-updated date. 


‘See the Chapter “Technologic (In)accessibility” for the 
importance of ensuring multiple paths to success are 
possible 
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TIP 


Don't pull the dataset's last update from “now” (the 
‘moment you physically process the data). Instead, try to 
use some data feature to determine how up-to-date itis. 
Ina perfect world, we would mark when the record was 
‘ereated—not the local copy's creation date, but the date 
the datum came into existence. 


By checking the data itself for an updated date, you can 
account for your upstream provider experiencing issues 


It's better if you don't repeatedly process the same, 
unchanged dataset. 


core definition of fit for use, but we find itis nota boolean Of 
or ordinal value but rather a unit interval where the unit is 
the size of the expected time 


Knowing how close datasets to changing states our 
ae al 


‘That's along way of saying we can create a countdown for 
every state. 


Gov SEC 0.5% | 


Market NYSE (ale | 


Rather than very busy error messages, we have relied on 
the icon to give context and then supplied a 
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countdown/progress meter to help the user understand 
how the data they are interested in is impacted. 


Every state has a countdown of some kind 


‘+ Valid: shows how long until the data is expected to 
be renewed. 


‘+ Refresh: gives an estimated time to completion 


‘+ Expired: demonstrates a sense of how bad the 


‘The meter gives the observer visual sense of 
completeness asa proportion, whilea textual 
representation gives a sense of scale forthe whole anda 
reasonable ime frame for when a user can expecta 

spat SHaNEAs these are estimates, times are given ony inthe 
{B20 rmaor ni with abroad fractional unit (quatre and 
EEWAE thids) to prevent a false sense of precision The supple 
i 78ER meter can help users decide i the indicated time s worth 


NR rating for (99% fresh vs 1% fresh, 


time datetine='PTAH2410S'> 8 % he/time> 


While not an official unit format, the unit of measure ("h*) 
conforms to the units expressed in $0-8601, Using an 
958730 intemational standard is intended to maximize the reach 
Fu toa global audience. The exact ISO-8601 specification is, 
BF5 used in the underlying computer-readable embedded 
‘microdata ofthe text to accommodate computer-aided 
‘comprehension (screen readers). 


"The meter and time are abandoned for overruns to avoid a 
reverse meter causing confusion and because data can 
overrun by multiples of our unit interval. For example, an 
upstream issue could cause a 15-minute refresh to be out 
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for hours, resulting in 413% overruns (for example). A 
percentage allows users to decide haw significant it is to 
them, as a 0.5% overrun on annual data may not be 
considered significant at all. 


An Aside: The part I'm 
embarrassed to show 
+ another Status 


WALL OF 
TEXT 


+ Trapped 
negative 
Space 


(Gov SEDAR 
Market TSX 


‘Normally, would not show this transition in the chart; I 
skipped over some critical things. They are ugly, and I 
don't usually like to advertise my crappy ideas, but in this 
case, I wanted to take a moment to talk about them. 


want to show how these issues can be spatted and what 
they look like while you are working on them, 


‘As our chart progresses, Ihave discussed THE WALL OF 
TEXT, and we can see another one creeping into the design. 
‘The error messages have little meaning and will require 
translation to accommodate a multilingual audience 
(expensive and time-consuming). ‘Those countdown 
timers are even worse. 


What you are looking at are my personal attempts to 
resolve THE WALL OF TEXT issue while accommodating a 
diverse audience. 


+ canweuse decimals? 


34 


© Howmany decimals do people care about? 


© What unit of measure should we use? 


‘Seconds? Minutes? Months? 
ope © What languages do we use to express the 
aw, unit of measure? 


‘+ Maybe we can use the Canadian and International 


Bisse ——_Metric Specification for Time Formatting 
img — USQ-860n, CSA-Za4u4"") 
as 
oe © ‘The units and symbols are defined 
internationally 


© Itcan be interpreted by digital tools (screen 
readers) 


© ..but it has very dense, unreadable text 


We definitely have at least one problem that we are in the 
middle of working through. Working through results like 
this is part of the editorial process, 


‘The result (shown above) was a compromise between 
these two formats, but | wanted to show that it would only 
have existed with this middle step. 


= CAN/CSA~Z234 is the specification definining “metric” 
in Canada. 234.4 specifically addresses date-times. 
https://www-sce.ca/en/standardsdb/standards/ 4449 
Whitespace is not your enemy, Chapter 4 

‘https: //whitespacedesignbook.com/portfoliojchapter-4-1 
ayout-sins/ 
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‘The obvious problem with Trapped Negative Space is that 
it is unused. Space requires resources to fill; either itis 
space that another chart could have used, ar we could even 
talk about the carbon emissions associated with the screen 
space. We always decide to include or exclude information, 
and including negative space means we have implicitly 
excluded something. 


Excluding important information because we ran out of 
space is just unfortunate. 


Generally, we don't consciously think about negative 
space, but subconsciously, our eye is drawn to it: it's 
different, out of place, something should be there; nature 
abhors a vacuum. This should be used to bind objects, but 
trapping it creates false boundaries that the eye follows. 


‘In developing a chart, we want to communicate significant 
information to people. Humans are a species with strong 
pattern recognition capabilities. We can use uniformity to 
draw the eye away from insignificant things, This act of 
creating a uniform baseline allows for the differences to 
stand out, 


In this case, the Trapped Negative Space breaks the 
pattern we are trying to express. As I have heard in many 
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design classes, if you highlight everything, you have 
highlighted nothing. 


Ifyou highlight everything, you have highlighted 
nothing. 


— Sharon Cave, Sharon Cave Fine Art™* 


SEREAD it yousee Negative Space, like with THE WALL OF TEXT, 
‘m2 you have done something wrong and need to refine your 
Bani v0 ig wrong, y 


GES 


work. 


Inthis case, it was noticed that the negative space was 
nicely checker-boarded. Errors do not have meters, and 
successes have meters but do not have text. It was a simple 
step to simply collapse the interlocking space. 


Next Steps 


‘This chart seems reasonable for expressing the ideas that 
‘we want to share with our customers. It is accommodating 
of diverse biological and digital users and concise enough 
to convey information quickly. 


‘This isa good start. 


‘This was only a wireframe and a quick sketch to help get 
feel for the data that should be presented to users. For 
example, that header has a lot of Negative Space just 
‘screaming to be moved around, and the black grid is overly 
contrasting, drawing the eye away from the information, 


‘+ Lecture Notes, Sharon Cave Fine Art 
https: /www-instagram.com/sharoncave6/ 


3b 


Visually, this chart needs a lot of work. 
‘© Marketing design 


‘This has not even begun to integrate with corporate 
look and feel. A look at the overall design of the 
parent reports is necessary. 


© Accessibility 


‘While it has helped to remain constrained to basic 
accessibility, specialists must look deeper into this. 


© Translation 


‘There is some text, and elements of the text will 
need to be translated into supported languages. 


© Peerreview 


‘This is representative of a single day's work and 
has not seen a review from peers that may express 
‘concerns within our domain, 


Having said that, the changes suggested by these various 
‘groups will address aesthetic reasoning, We have carefully 
‘minimized the overlap between the design and data 
concerns during the design process. The data is rendered 
asa simple table, with formatting controlled separately. 
‘This means that designers can drastically change the 
design without significantly changing the data produced. 


‘Again, it's not perfect, but looking toward future 
‘cooperation is always necessary, and being open to their 
suggestions (or sometimes outright changes) can bea 
‘wonderful learning experience. 
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Animation 
12) 


® I remember reading an article by Mike Rostock 
describing the value of having sorted graph bars slide to 
their new position when they needed to be changed. 
Humans see motion and fixate on it, and watching a chart 
item change position helps us comprehend the change. 


‘This chart (a horizontal bar chart) represents the exact 
scenario he was describing, We order items by error state 
(most significant at the top), and using animations to 
draw attention to changes in order would draw attention 
toa substantial change in state. 


Status 
Gov SEC (a) 8.5% 
‘Market NYSE Gain o 
‘Gov SEDAR Guam 
Market Choe @avamromm 


laa = $5 


Summary 


We were asked, "How will users know the dataset is ‘fit for 
use'?" I think we have achieved that, but we've gone much 
further 


Object Constancy, Mike Bostock, 
Ihttps:/bost.ocks.org/mike/constancy/ #when-constancy- 
matter 
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‘© Wehave created an advertising list ofall our 


‘© Fitfor Useis estimated but left to the user 


‘+ Iti accommodating of various user needs, offering 
uultiple success paths®* 


‘+ Information density is high but not overwhelming. 


‘© The reduced text allows for itto be used 
multilingually 


‘+ Design and Logic have been separated as concerns 
for easy collaboration 


‘Not bad for 4 hours after supper, and (frankly) alot of fun. 


1'm hopeful that this cart will make its way in front of our 
users; I think it will help make our service more visible to 
new users, offer alot of information to our current users, 
and free up the team’s time from many status reports. 


Ironically, the day after I first presented this report (and 
‘wrote most of this post), a colleague gave a presentation 
‘on how we present information to users. Int, they 
emphasized the need to meet the viewers where they are 
and not complicate the problem. From the online 
audience, I muted my microphone and burst out laughing. 
Abrief chat between us after the presentation summed up 
‘our shared perspective: 


Keep it Simple, Stupid 
Unfortunately, that's sometimes a complex thing todo. 


See “Technologic (In)accessibility” for the importance 
‘of multiple paths to success 
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Further Reading 


‘There is so much great reading to be done on how we 
convey information to users. 


‘The first stop for me was everything by Mike 
Bostock. Most famous for his workas a New 
York Times data visualist, he also invented D3, 
‘and Qbservable HO, You should read 
‘everything you can by him. 


Fundamentals of Data Visualization, by Claus 
Wilke, is a must-read for anyone interested in 
visualization and makes an excellent textbook 
for any classroom 


White Space is Not Your Enemy, by Rebecca 
Hagen, is a good introduction and foundation 
into the visual arts froma marketing 
perspective. This supplies the right level of 
‘general theory to apply across several visual 
disciplines, 


How to Lie with Statistics, by Darrel Huff, 
Y2E© Hutt focuses on how humans interpret (or 
{40885 misinterpret) numbers and how our 
EK expressions of those numbers can help, hurt, 
or misdirect understanding. 


‘Most importantly, remember that User Experience is more 
than just rounding the borders of HTML. It's about 
understanding the psychology, anatomy, and physiology 
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behind our interactions with our users, so make sure you 
spend time talking to them to understand their experience. 
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Error Message 
Habituation 


Being aware of and mitigating the risks of 
Habituation in System Maintenance and 
Design 


Iwill give multiple examples from my career 
demonstrating the dangers of error message fatigue and 
habituation leading to ignoring vital signals. Further, they 
will show how easy itis for humans to fall prey to 
habituation, Finally, I wil conclude with specific 
techniques and modern tools that can reduce the 
frequency of its occurrence. 
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have three cases rolling around in my head, one from 
2004, one from 2012, and one from about three months 
‘ago, $0 it seems I'm doomed to relearn this lesson about 
once a decade. 


{As the newly promoted Principle Developer at my first 
company, I inherited a successful but suffering from 
‘growing pains product. The previous Lead had been a 
‘graphic designer and had done a wonderful jab of building 
usable and popular interface, but some of the more 
engineering aspects had been forgotten along the way. 
‘This was reaching a point where it was hampering product 
growth, 


brought a new perspective on quality and reproducibility, 
‘tuming the tool from a website into an administrative 
product. This change in focus between us meant I had 
years of engineering neglect to catch up on. 


‘Afew months later, it wasn't a surprise to me when, one 
day, the company's owner stormed into the development 
department angrily and started to rant about a severe 
defect where data was being dropped. It was a real edge 
case, and the scenario he was talking about was something 
Thad seen myself, but only rarely and never reproducible. 
Focused on known issues and bringing the system into a 
state capable of expanding, Ihad easily written it off as a 
ghost in the machine, thrown it on the bug list and pushed 
itway down, Unfortunately, a customer had seen it this 
time and could reproduce the issue reliably enough that it 
hhad become an embarrassment to the owner, so now it was 
at the top of the priorities list and became the focus of my 
exploration. 
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While I don't remember the exact defect, I remember 
following the defect into the Web Server's logs (Windows 
Event Viewer). Previously, my focus had been on the 
system's external user behaviour, but there were 
thousands upon thousands of lines of warnings... every 
minute in the logs. Warnings about uninitialized variables, 
unsafe typecasts, and .. alittle bit of everything. Thad 
never really given much thought to it because it was so 
‘overwhelming as to be meaningless. Still, when I isolated 
interaction with the form in question and filtered the logs 
down to just that time frame, I could consistently se one 
‘warning that seemed to relate. 


I'd found my needle in the haystack, but it wasn't an error. 
twas just a warning among thousands of warnings I had 
been ignoring. As I dug through the logs, Icould see this 
‘message appeared regularly when that form was accessed 
(not always, but regularly). It went back to the very 
founding of the system (before Thad even started my 
programming education, Iwas still anuurse). This error 
hhad been getting reported for almost a decade, and nobody 
hhad seen it for two reasons: 


1. itwasawaming, not an error 
2. the signal had been lost in a sea of noise 


‘eared a valuable lesson that day. No Warnings, No 
Errors became my mantra. 


"Naturally, I first fixed the issue that had been spotted, but 
this defect had been signalled by a warning in the system 
logs that if someone had addressed it, we would have fixed 
italmost a decade before. So, I started to address all the 
‘warnings in the logs. 


‘Most were relatively benign, identifying (perhaps) that a 
variable had not been explicitly initialized before use. Still, 
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since null was treated as a zero or empty string, it didn't 
impact the system's behaviour. But as [made minor 
corrections and the log volume reduced, some of the 
‘warnings started taking on more ominous tones. More 
system defects were identified (and corrected), and most 
significantly to me, an actual error started to present 
regularly ..One that had been missed in the excessive 
volume in the logs. 


So, what was the core of the lesson? 


Seeing a large volume of errors can make us insensitive to 
them. When we ignore significant messages, we train 
‘ourselves to not pay attention, and that's when bad things 
happen. 


Cae 
£6 055 he teamhabituaton s sed in several elated contents, 
Hf Including medial, social, and psychological stil the 
general summary would be the oss of ecogition of 
egatve simul due to repeated (habitual) exposure 0 
them 


\We ee this all around us and nour day-to-day ves, 
People get habituated to getting yelled at by a pet, 
becoming numb to the exposure. Physically, carpenter 
may become desensitized to geting slivers, simply pulling 
them outa theend ofthe day rater than immediately 
flinching Vike aot shower, butt usualy takes a 
‘moment for my skin to get used othe hot water. thas 

GO ornces dor tocedt ota resurser pudbe 

7 2 known to flinch when touched but wil stop flinching with 

S48 repented touching 


‘This helps us get on with life. 


354 


linching isa critical reflexive reaction that protects us 
from bad things happening. Stubbing your toe should 
produce an immediate "protect your toe" response; 
getting an unexpected cut on your hand is dangerous, and 
I should jerk my hand back from scalding water; but 
‘sometimes the cut is minor and expected (slivers) and for 
the most part just part ofthe job. life has to go on. hot 
shower isa significant temperature change, but itisn’t 
harmful and is rather pleasant once I get used to it. The 
process of habituation allows us to maintain our 
high-alert state while at the same time learning to 
‘moderate it under various conditions. 


‘Humans are biologically queued to become habituated. Itis, 
part of our survival strategy as a species. It's built into you, 


‘Therefore, you cannot ignore the risk of habituation to our 
systems. 


Receiving an error signal (error messages, warnings, 
failing tests) regularly, evaluating it as safe to ignore, and 
not taking action psychologically prepare you to ignore it 
later. It begins to habituate you to the error signal, placing 
itona pile of things we can ignore. 


Common Examples of 
Programmer Error 
Habituation 


have found examples of error habituation at every 
organization I have ever worked at and in every role Ihave 
filled. They do not always present the same way, but they 
are pervasive throughout the industry, even presenting 
themselves as Best Practices to the untrained eye. 
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Errors and Warnings 


‘This is the most obvious since it's right in the name, which 
‘makes ita good place to start. 


‘+ Compiler warnings 
‘system Log Warnings 
‘+ Pager notifications 


‘This was my first exposure to this. We learn through 
practice at school that compiler errors prevent us from 
submitting our assignments, but warnings do not. With 
the short intensity of student life, ignoring warnings 
becomes a habitual survival strategy. As we become 
‘mature professionals, we learn that these messages were 
put in place to convey meaning to us and offer us, 
protection, 


Known Software Defects 
Defects in software are discovered, and discovering and 


correcting them is the art form. In the words of Robert 
Glass: 


43, Maintenance is a solution, not a problem 


— Facts and Fallacies of Software Engineering” 


+ Facts and Fallacies of Software Engineering, Robert 
Glass 

https://www amazon ca/Facts-Fallacies-Software-Engine 
cering-Robert/dp/0321117425? 
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Its important to realize that defects must be expected and. gy 
worked on. At the same time, there is always morework {32 
than time, o some form of prioritization isnecessary. 3" 
This means we must ignore them forawhile(evenifitis  @ 
Just the time it takes to fix them), 


‘The problem is that the more defects we acknowledge are 
present, the more we tend to ignore them as irrelevant. 
"The more we defer fixing bugs, the more we get into the 
habit of deferring bug fixes. 


TODO and Change Comments 


‘ToDO comments within the code were a way to identify an 
item that needs to be addressed; we'll come back to this, 
later. 


‘There isa strong likelihood that we are ignoring the 
problem because we are busy with something else. 
Certainly, it is impossible to split ourselves into two to 
address both issues simultaneously, so note the secondary 
problem. In contrast, we address the primary one that 
makes sense, 


‘The problem arises when we don't come back to it. 


Accumulating TODO notes through code can become 
excessive noise, causing us to start ignoring the message. 
Further, as these are usually listed along with the 
‘warnings and errors, they represent noise that drowns out 
more important signals. 


Do not become habituated to seeing useless comments, 
header in an individual file containing alist of every 


change ever made to the file is a typical pattern that has 
become an anti-pattern, The purpose of these comments 
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isto offer alog of changes that have been made to the code 
in their contextual place. 


Unfortunately, this (good) habit started decades ago with 
different coding styles. The pattern assumes that a single 
file is self-contained to all its changes and does not 
interact with other entities (since recognized as a faulty 
assumption). There is also the problem that decades of 
‘messages accumulating atthe start of the file means an 
impenetrable WALL -OF-TEXT must be scrolled past before 
anything meaningful can begin. This immediate scroll past 
habituates us to perceive large blocks of comments as 
‘meaningless when we should consider large explanations 
in the code something important and meaningful. 


What started as a good idea for small files over short time 
frames has evolved into a bad idea with better alternatives 


Relearning the Lesson 
(Twice) 


A decade later, I found myself on contract with a major 
corporation that had terminated its previous contracting 
company due to poor quality performance. My team had 
been hired to not only deliver but also do it with an eye to 
quality. 


On my first day reading the regression test suite, 
naturally glanced at the warning list to see how many 
‘warnings were in the code. I immediately found myself 
staring at alist of hundreds of warnings and thousands of 
‘TovO messages. Naturally, I tried to ignore them... they 
‘were things that needed to be done in the future, not 
immediately. However, as I cleared the significant backlog 
‘of warnings, I started to come across the TOD0s' locations. 
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It was horrifying. 


oTest 
public void ReallyImportantThing() ( 
1/1000: implement this 
Assert. asgertTrue(true) ; 


) 


Inmany (most) cases, the note suggested a person should 
{implement the test for real. (There is a similar story, I 
‘thought, told by Spolsky, of anotorious function 
implementation in MS Office ...the same thing) 


Inmy case, I suspect the previous team, under pressure to 
perform and deliver, had been masking gaps for along, 
‘time. Many regression tests were simple stubs that 
returned a success no matter what. This allowed them to 
claim the job was done while promising themselves they 
‘would fixit... later... when they had time. That time never 


It was difficult to explain to the client that I was taking a 
week to reevaluate how much testing was actually being 
performed. When I reduced their test count by more than 
hhalf, Ineeded to remind them they had hired us 
specifically because they knew there had been a problem, 
‘Identifying those problems and giving honest assessments 
is where our value comes from. 


‘Tods were added to my list of things not permitted in 
code bases I was involved in. 


No Warnings, No Errors 
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Another decade has passed, and recently (weeks), Thad to 
catch myself again, 


implemented a basic continuous monitoring alert system 
‘ona new system we are working on. It periodically scans 
the system for invalid states and immediately notifies the 
team of the bad state (OK, it informs me and one other, 
and we notify the larger team .. baby steps). If an alert is, 
issued, we must act to save the system. 


ignored a message. 


Inthis case, the alert was to notify us that we had stopped 
receiving signals from a remote source, and I had ignored 
it Asabatch process, itis not uncommon for the source 
process to take longer than anticipated. This isn'ta big 
deal since usually, it delivers shortly after we check, and 
‘we just pick it up on the next pass. 


[Except it is a big deal because I ignored it. 


‘My colleague, just returning from vacation, called me and 
asked if Thad noticed that the system was erroring, she 
didn’t see a ticket and wondered if [was dealing with it. 
told her it was no big deal, that one fails regularly... and as 
the words came out of my mouth, I heard what I had just 
said, 


‘Sure enough, we looked closer, and the failure occurred for 


three cycles; the source was not transmitting data, and I, 
through habituation, ignored the failure. 


Preventing Habituation 
‘Theres really only one solution to preventing error 
habituation: address every ertor, warning, or notice and 


treat it immediately and with the utmost priority 
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While I say this, there are some subtleties to how we 
achieve this 


1. Always fix defects before implementing new 
features 


2. Never ignore a defect message 


Various tools are available to help us address this and 
various mentalities. 


Errors that can be ignored 


‘There is no such thing as a reported error that can be 
ignored, 


1. the system is in an invalid state and needs to be 
fixed immediately or 


2. the error notification system is flawed and needs to 
be fixed immediately 


Incorrect notifications could be alog monitor that alerts 
when an invalid state is encountered. Upon inspection, the 
state is determined to be undesirable (not invalid). We'll 
ignore the error; it will correct itself later. 


'NO! Change the log monitor to take into account the new 
information. Itneeds to be run less frequently and count 
hhow long the error state exists (waiting before alerting), 
‘but whatever gave you a reason to think it can be ignored 
needs to be incorporated into the official rules for alerting. 


Failing Tests 


As previously mentioned, you can't be working on two 
problems simultaneously in two places simultaneously. 
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‘ne problem must be set aside while you focus on the 
other. Unfortunately, this leads to ignoring errors, which 
becomes habitual. 


‘To avoid this, the first step is to create a taskcin your 
backlog, which immediately gives us a record of the issue. 
Secondly, we should immediately generate an automated 
test that can allow us to reproduce the error. The problem 
is thatthe test will be failing, constantly reporting an error 
tous. This is a failure signal that we want to ignore 
(probably using a SKIP) until we get the defect fixed; we 
immediately mark it as skip, an ignore status. 


‘This isa problem. 


DSB) we can esolvethis by havinga team rule thatall sk1P 
e 


| tests MUST" have a ticket number associated with them 
2 and addressing every skip during every planning meeting 

For me, this often takes the form of reporting skips 
‘without ticket numbers, as fails and fails must be 
addressed immediately. Skips with a ticket link directly to 
their ticket in their reporting, 


esage 
Lastly, the eam MAV implementa zero defects policy", {ifm 
‘hiss an agreement with the business tht defects willbe [5 


fixed before new features are implemented”. 


FRFC-2119, Key words for use in RFCS to Indicate 
Requirement Levels. Defines the interpretation of 
keywords in engineering specifications. 

https: //wwowrfe-editor org/rfe/rfc2119 html 

+ Zero Defect Mentality: History and Steps to Zero Defects 
“Manufacturing, Renaud Anjoran, 2021-07-09 
hittps://www.eme-consultants.comyblog/zero-defect-me 
ntality-implementations-and-history 

+ Zero Defects Philosophy in Software Development 
Environment, Agile Development 

https: //wwuwagiledevelopment.org/agile-talk/134-2er0-d 
fects-in-software-development, 
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NOTE 


have never been satisfied with the reports generated by 
testing systems, and I have always weitten custom 
visualizations to track defects. This has had the side 
effect of my introducing concepts like adding extra 
states to TestNG's default reporting (known 

manual, feature") with active links tothe repository 
and issue tracking software. 


Comments Calling for Action 


‘ToDo comments were a classic way to express something 
you need to come back to and finish something off, and 
they still have their place, but fundamentally, they are a 
call to ignore the problem (but only for now). 


‘The problem here is the same as with all the others; we 
need a way to prevent it from becoming forever. 


A straightforward way of handling this is to put a Version 
Control hook in your repository that prevents check-ins of 
‘ToDO comments. Generally, you should only put this on 
protected branches. This allows you to put them in your 
code to enable you to continue working but prevents you 
from submitting them to the official branch by accident. 
Forcing you to finish the job you planned on doing. If you 
can't get to a T000, don’t leave it in the code; register it in 
the backlog as something that still needs doing. This 
leaves the alert list available for warnings and errors so 
they don't get hidden, 


‘Those massive headers at the beginning of the code only 


‘work to mask issues. They get in the way of text searches 
and require a lot of visual space to scroll past. All that is for 
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‘something that is based on an old paradigm: changes are 
constrained to a single file. 


‘Modern VCS tools assume that a change may require the 
context of several locations in code to be meaningful and 
have logging built into them. Keep your changes in the 
Change System database, reducing the visual noise by 
placing them in a contextual list that is hidden until you 
need them. When you need them, the list is optimally 
indexed for what itis. 


Conclusion 


‘As humans, we make mistakes. Each decision we make is a 
totally new decision that we must make, injecting the 
‘opportunity for error. This opportunity for error can be 
‘compounded by biases introduced from our experience. 
Habituation of errors represents a biasing of our behaviour 
that we are biologically predisposed toward and can be 
dangerous to our work. 


Itis important that we, as professionals, work to overcome 
these dangerous biases through constant diligence and 
self-appraisal. 


As software developers, our work captures 
decision-making before the stimulus and action, and 
defects can have catastrophic effects. Teaching ourselves 
to ignore benign errors can mask more catastrophic issues 
that have a significant impact on people's lives: 


‘Aeroplanes fallout ofthe sky 
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BERD seat tues omen pe atanues 
ae and imprisoned 


Further Reading 


Thope I've made the case that itis easy to teach ourselves, 
to ignore errors because we are humans, and humans are 
fallible. Addressing this is hard but not new. 


acts and Fallacies of Software Engineering 
(Robert L. Glass) 


isa great read that opened my eyes to how 
‘common these issues are and how we all 
‘want things tobe true, even when they 
aren't. I keep a copy of the table of contents 
{na text file, just so I can search it regularly, 


Downfall: The Case against Boeing (Netflix) 


519 ascusses an important event in computing 


Ose 
EERE history. Remember that in 1969, Software 
GEE saved an aircraft with abad attitude sensor, 


‘while in 2018, Software killed 318 people due 
toa bad attitude sensor. 


AGE investigations 


GEL observe how multiple people must ignore 
‘warning signs fora long time for a problem 


one ‘Any YouTube video on Aircraft crash 
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to occur. Note how easy it is for dangerous 
behaviour to become habituated. 


Consider reading the manuals of your favourite tool suite 
to geta better understanding of why the software was 
developed, how itis meant to help you, and how itcan 
replace some practices you may have thought were a good 
idea 


EAGER netodaly 


fo} 
F245 Project Management tools can help 


H298) _priorize and track outstanding issues 


a {oO 
aa ‘Test Suites can help you identify errors 
ee 


Version Control systems can help you 
‘understand why historical changes were 
made, offering a significant amount of 
context when you need it. actually 
recommend reading SVN's manual as it 
brought a substantial paradigm shift at the 
time it was introduced that needed to be 
explained (Use it, read about SYN) 


and always pay attention to your own emotions and 
biases... your mistakes are always available to learn from. 
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Maintaining Privacy in 
Data Shares 


A Guide to Big Data Privacy for Dummy 
Developers 


‘+ Balancing the power of modern computational 
‘capabilities with individual privacy is a formidable 
challenge in data sharing. 


‘+ Benevolent actors, in thelr quest for insights, may 
inadvertently breach privacy by applying multiple 
dimensions to data and exposing individuals. 


‘* Introducing a privacy metric to mechanically 
‘measure the risk associated with shared datasets, 
aiding in decision-making for responsible data 
sharing, 
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When we delve into the intricacies of data sharing, one 
paramount consideration arises — privacy. The very 
‘essence of privacy hinges on how informative the shared 
datas, necessitating a quantifiable approach. 


In today's information-gathering landscape, the sheer 
capacity often leaves one in awe. Privacy and data ethics, 
perennially debated topics, trace their roots back 250 years 
to the US Constitution's Fourth Amendment, which asserts 
the right to be secure in personal papers against 
unreasonable searches and seizures. This is just one 
significant example of recognizing the perils of exposing 
personal information to authorities. 


Long suspected, a revelation in 2013 confirmed 
governmental organizations, such as the NSA's, 
information capabilities to the forefront. The revelation 
that the NSA accumulates every phone call from every 
individual was not just mind-boggling; it raised 
significant questions about privacy and data ethics. The 
sheer scale of data collection raised concerns about the 
potential misuse and the safeguards in place. 


We're not talking about datasets reminiscent of the Access 
Databases of the 1990s for mailing lists; these are colossal 
databases maintained by governments and 
‘mega-corporations. They harbour the potential to craft 
exact profiles of individuals. When amalgamated, the 
sheer volume of data generated by individuals provides a 
unique perspective on the individual". 


© K6anPpQOndxPFz0tgikKaw==" . This value is generated 
from my browser settings. Itallows my online activity to 
bbe uniquely tracked across multiple websites and without 
the assistance of cookies. What's your fingerprint? 

https: //fingerprintjs.github io/fingerprintjs/ 
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How do we balance the power of modern computational 
capabilities with the privacy of the individuals whose data 
we collect? 


The Road to Hell is Paved 
with Good Intentions 


‘There are many layers to security, but often, the first that 
should be applied is to guard against misuse: legitimate 
users use the data for inappropriate purposes. 


In movies, a private investigator will buy information 
from an informant; super-spies are seen breaking into 
‘vaults to steal information about an individual. These are 
not complete fabrications; I have been involved in several 
investigations regarding personal information leaks. 1 
hhave been involved in atleast two cases of data theft (both 
asa Nurse and as a Data Professional); in both cases, a 
private investigator hired an insider to look up 
information in the system. In both cases, gathering 
evidence to trace the activity was as simple as looking up 
data access logs that did not align with business duties but 
did align with the suspected misuse. The system's 
‘mechanics were sufficient to restrict, identify, and enforce 
access to data. 


ata Warehouses and modern analytics add a wrinkle to 
this problem. 


ata Warehouses can be compelling information 
resources. Depending on your business line, they will 
contain all the joined data and personal information of 
customers and employees across multiple business lines. 
For governmental organizations like the NSA, the 
customer data points are the citizenry and visitors to the 
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country. This is a potent tool for analysis; however, it 
carries risks of misuse. 


‘Mechanically, the nature of the bulk data thatis being 
shared makes it vulnerable to misuse by benign, 
‘malicious, and (most importantly) benevolent actors. 


‘© Keeping malicious actors out is obvious: do 
security background checks to find honest people, 
hire honest people, and create guidelines for use 
that honest people can follow. 


‘© Benign actors are .. well... benign. They are the 
honest people you hire and are happy to follow 
‘corporate policies like: You must not look up. 
‘yourself, your friends, or your family. 


‘+ Benevolent actors are more complicated. 


lf we have screened for honest people, we have biased our 
search for helpful people. Combine that with large datasets 
and the fact that data analysts are curious people, and we 
have a recipe for disaster. 


‘The benefit is derived from performing aggregate analysis, 
‘on many detailed values. If an organization has sensitive 
data, but an honest analyst can offer some significant 
insight by inspecting it, they will make the data available 
to the Analyst. This generally takes the form of the Analyst 
proposing their study, describing their data needs, and 
then sending them the data that aligns with their request. 


and this is where it starts to fall apart 


an 


Areasonable request for data may exist within the United 
States Census Bureau to create a report identifying a 
gender wage gap. 


oye 
AUnitedsntescnss tras daa vsszaton gating gender FEC SE 
oo 


‘wage gap by US state avallableonline 
io) 


‘To support the request, the analyst is given access toa 
dataset containing all tax filings for the past dozen years. 
‘The analyst then loads the data into the analysis tool of 
their choice, does a simple sum by state and sees their 
results. 


° What Is the Gender Wage Gap in Your State?, United 
Census Bureau, Megan Wisniewski, 2022-03-01. Contains 
an excellent example of a valuable analysis. 
hnteps://www.census.gov/library/stories/2022/03/what-is 
~the-gender-wage-gap-in-your-state html 
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select 
year, 

sum(case when gender = 'M' then incone else @ end) 
7 sun(case when gender = 'M' then 1 else @ end) 
as avoM, 

‘sum(case when gender = 'W' then income else 0 end) 
7 sun(case when gender = 'N’ then 1 else @ end) 
as avol, 

‘sun(incone) /eount(*) a8 avg 

from 

dataset 

group by 
state, year 


‘Their manager approves hitting the publish button, and 
the team heads for lunch together. 


ver lunch, the original author is discussing their findings 
with their colleagues when someone asks a simple 
question: 


‘wonder if age has anything to do with that? 


‘That's an interesting question and may be useful to add to 
analysis reports. So the Analyst goes back to their 
favourite tool. The data is still in the analysis tool, and all 
they have to do is adjust the parameters of the query. 

‘This is where informational security starts to break down, 
‘This is not what the data was authorized to be used for. 


{While this isa benign example, each data dimension 
brings us closer to revealing the individual. n hei ish to 
discover meaningful insight, the Analysthas applied two 
dimensions tothe individual in question. While this may 

HEE notbeabigdea atthe stele magne aping ites 

#2. Fike this othe tow of Albert, which has a population 

RETA or ae suddenly, using age maybe unique enougho 
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distinguish some individuals, and publishing their wages 
to their neighbours could cause bad-blood in town, 


‘The security (privacy) of customer data is of paramount 
importance. It requires measuring the information about 
the individual we are sharing. This concept of revealing the 
individual effectively that of information entropy. Entropy 
isameasure of the amount of surprise; in this case, itis, 
the amount of surprise we have when we discover the 
actual person, 


Aside from purely mechanical safeguards like network 
access controls, encryption, and permission, we need to 
consider the possibility of shared data being misused. The 
ultimate security tool is to simply not share data we do not 
‘want people to have access to or (more importantly) to 
control the context in which the data is interpreted. This, 
allows us to maintain a high level of entropy around the 
individual. 


Whenever someone asks for access, we must evaluate 
‘whether we are exposing enough information to expose 
the individual. 


‘This requires human thought and analysis, and humans 
‘make mistakes. Overworked, over-tired, or pressured by 
the office bully, a human may give permission to expose 
‘more data than is appropriate, allowing researchers to dox 
an individual accidentally. Some mechanism for 
‘mechanically, automatically, and unbiasedly measuring 
the privacy risk of a proposed dataset is necessary 
because. 


Hell truly is paved with good intentions™, 


as ate (Oars when someone ew a match inh 
‘he good rcmton of stoping ogc ask Daas Cater 
Iipeden wap rt Darna, gos ete 
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The Problem 


‘Tobest understand the problem, let us consider a 
simplified example. 


By studying this dataset, we can see that itis abasic 
income data set for US citizens. It includes some 
demographic information like their name and gender, as 
well as some contact information. While it does have a 
random identifier, it also contains their Social Security 
‘Number, something we do not want to just hand out to the 
first Private Investigator who asks for it. 


ur goal is to give information to customers (analysts) 
‘who request it and ensure we do not provide so much that 
‘we expose the individual. We can define this as 


the information given to the analyst must not 
be sufficient to identify a single individual 


‘The very first step is to remove the account identifiers. We 
don't want those handed out, but the rest of the data is. 
unclear, What data does the Analyst need to satisfy their 
research needs? 


ur customer is studying income, so we must include that, 
bbut the concern would be that it gets exposed. While the 
individual might recognize their income, assuming 
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they've kept that private, it should stay private and not be 
associated with them, 


Let's start by considering the email address. Email 
addresses are designed to be unique to the individual. Let's 
give our customers the dataset with only an email address 
and an income. We have effectively tied the income to that 
person. On the other hand, exposing the country does not 
tell us anything unique about the individual (everyone in 
‘our datasetis from the USA). So, if we are going to share 
data, we want to share data we want to share data with 
‘minimal impact. 


ur Analyst studying gender inequality (Alice) is doing 
pretty well, having only asked for gender and income. 


Another analyst, a couple of desks over (Bob), has been 
‘working on an algorithm for a while and thinks he can use 
Last Name as a proxy for race, He would like to do a study 
using names and incomes, The Privacy analyst, on the ball, 
notices that full names are perfectly unique, so he offers a 
list containing only last names. 
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Alice and Bob are discussing their findings over lunch one 
day when their co-worker Eve overhears them. Eve's ears 
perk up because she is also a Private investigator who is, 
always looking for exciting datasets and surreptitiously 
acquires the two datasets, 


Eve knows she has acquired something useless: the two 
datasets have been vetted to ensure that the individuals, 
involved cannot be uniquely identified. However, with 
closer inspection, she notes that the incomes are unique. 
Using this insight, she joins the two tables. 


Eve has achieved an interesting effect. She has not only 
joined the two original datasets to geta more complete 
picture, but she has also constructed likely salutations that 
are new data that exceed the scope of either of the original 
datasets. While it is not a complete picture, she has started 
to build a profile on individuals. These profiles can be used 
for purposes that exceed the original permitted use of the 
data, 


A Real Risk 


While this story is obviously made up, itis not unrealistic. 
fone looks at the way we package modem reports, one 
‘can see that these risks are present everywhere. 


‘Modern data presentations demand some level of 


interactivity. The ability to filter, change, and compare the 
data on the fly isa powerful and compelling tool. But there 
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isa risk. To share these visualizations and make them 
dynamic, we have to ensure sufficient detail is embedded 
in the data we are sharing. 


‘That data isin the file; just because you don't know how to 
extract it doesn't mean nobody does. For the tool to be 


useful, some peaple do know haw"... that's how they 


‘make the visualizations work. 


Like Alice and Bob, we act with the best intentions but 
easily become Eve, sharing data inappropriately if we 
aren't careful, We share data with too little entropy and 
hide it behind a mask. 


What Can We Do? 


For people concerned with individual privacy, thisis bad, 
So the question becomes, as custodians, how can we 
prevent it? 


While human thought and analysis will always be 
necessary, they are subject tobias, inconsistency, and 
‘mistakes. Automating or offering automated 
decision-making aids to humans is always a good idea. In 
light of this need, can we develop a metric to measure the 
level of privacy in our shared datasets? Can this metric aid 
{in the decision-making process around data share 
approvals? 


How to readin tableau (tw) file nto python, 
Stackoverflow. Describes how to open and read a Tableau 
file with no specialty tools. Like so many filetypes, its just 
zip file with the data ina plain text file 
hitps://stackoverflow com/questions/48634674/how-to- 
read-in-tableau-twbx-file-into-python 
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Increase the Entropy 


‘There are a few normal practices we can follow to increase 
the entropy of the data. In our example, the two innocent 
datasets were joined using unique income. 


Initially, income was allowed because it was necessary, but 
that created the vulnerability. We can increase the entropy 
of the field by rounding it to some level. 


Do we need it to be accurate to the dollar? What about the 
thousandth of adollar? 


In doing so, we increase the number of people who will 
‘match the value and protect their anonymity. 


Measure the Entropy 


If we can measure the entropy, we should. Rather than 
leaving it to custodians to use their best judgment, we can 
offer them a way to objectively measure the state 


‘This metric can be made visible to both requesters and 
approvers to help them decide the appropriateness of the 
request, We can estimate the entropy of the request before 
it is even approved, allowing us to keep safety at the 
forefront of our minds. Later, we can measure the entropy 
of the request to ensure it is sufficiently anonymized 
before releasing it to the public. 


Create Versatile Environments 


Don't dictate; cooperate, 


One ofthe risks mentioned early in our example was the 
Analyst's holding on to the data. We produce the safe 
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dataset and then make it available to the Analyst for 
loading into their tool of choice. 


‘This is a common story: we have powerful computers and 
tools we've trained on. These tools have power, but they 
carry the risk of moving data off of the controlled 
environments. 


[By creating powerful and versatile environments, we allow 
customers to do their analysis ina controlled 
environment. We should also accommodate the needs of 
experts and create environments capable of 
accommodating those experts" diverse needs. 


‘This passively discourages requests to take the data 
off-system by giving them access to the tools they want. 


Further Reading 


@se%9 CE Shannon, Mathematical Theory of 
Be ny Communication, 1948. Shannon created 
{SHES Thelen of informational entropy and 
GEE developed away to measureit 
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How To Quantify 
rivacy Protection in 
Shared Datasets 


An Entropy-Based Approach for Objective 


Ey 


valuation and Automated Approval 


We will discuss tools for estimating and automating 
privacy enforcement in datasets, 


Objective Data Privacy Evaluation: Discuss a 
straightforward approach using objective measures 
to gauge and improve data privacy in shared 
datasets. 


Estimating Data Privacy Levels: Learn through 
real-world examples how estimating data privacy 
levels can be a powerful tool, minimizing bias and 
enhancing informed decision-making 


‘Automated Checks for Enhanced Workflow: 
Explore the application of automated checks in the 
approval workflow to simplify processes and boost 
productivity in safeguarding sensitive information 


©] Aworking examples available on 
ObservabletiQwhere you can see the 


calculations for yourself as well as try 


a 
ae ss 


Inthe previous chapter, we discussed the importance and 
challenges of preserving privacy, underscoring the critical 
nature of cautious information sharing. This principle 
extends beyond safeguarding personal identities to 
shielding covert subjects like structures, military units, 
and other sensitive entities. The guiding rules clear: the 
less revealed, the better. 


‘Implementing this principle poses a formidable challenge. 
Over-sharing risks exposing individuals and jeopardizes 
the confidentiality of various subjects. Assessing the 


Byt@ privacy of shared fates demands metieslouretfor. 
eo analysts invest ime scrutinizing datasets", 
So. 


74 ensure the concealment of both personal and classified 
© subjects 


However, relying solely on subjective analysis has its 


VNC inherent risks. Analysts, being human, are susceptible to 


biases, fatigue, and external pressures, occasionally 
leading to lapses in judgment. Despite these challenges, 
there's a growing need to share data for diverse benefits. 


‘The predicament endures: how do we establish a threshold 
for sharing information without compromising the 
privacy of individuals or the secrecy of sensitive subjects? 
How can we alleviate the burden ofthe privacy evaluators 
while simultaneously ensuring that shared data doesn't 
pose excessive risks? 


Data Privacy Handbook, Utrecht University 
‘https: //utrechtuniversity github io/dataprivacyhandbook/ 
research-scenarios htm! 
© Data Privacy Handbook, Utrecht University 
‘https: //utrechtuniversity github io/dataprivacyhandbook/ 
faqhtml 


383, 


Enter Claude Shannon's Information Entropy concept, 
originating from The Mathematical Theory of 
Communication in 1949. Shannon's concept of the 
smallest piece of indivisible information provides a 
{quantifiable measure of data. This measure is not only 
applicable to personal privacy butalso to the 
confidentiality of secret subjects. It furnishes an objective 
‘metric to estimate the privacy risk associated with 
exposing a dataset, serving as a valuable tool for analysts 
and automated systems in assessing risks related to 
personal and classified information. 


‘These concepts present a practical tool for striking a 
balance between safeguarding subjects' privacy and the 
{imperative to share data for analysis. 


Definitions 


‘To ensure precision and clarity, Ihave substituted the 
term "individual" with the more inclusive "subject" 
While my primary focus revolves around protecting 
people's privacy, it's crucial to recognize that some fields 


rmanagedatafacovertnaturethatextendsbeyond GS) 

hhuman entities. Examples include tight holes""in the oil 
and gas sector, covert police and militarylocations, and Sg! 
iscreetly insured and transported tangible objects. ohiee 
eric) 


5S, eng Glossary tight hoe Awellthat the operator E605 
requires be kept as secrt as possible, especially the : 
{geologic information. Exploration well, especially rank 
tildeats are often designated as ight 
hetps//glossary.slb.com/en/Terms/t/tight_hole aspx 

New details about $20M Toronto airport gold heist 

revealed in Brink's suit against Air Canada, National Post, 
2023-10-10 

inttps/nationalpost com/news/toronto/new-detatls-toro 
nto-pearson-aitport-gold-heist 
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Remarkably, the strategies for privacy protection can be 
universally applied, regardless of the nature of the subject. 


Central to our exploration isthe concept of privacy, which 
we define asthe resistance to deducing the individual's 
‘identity. Think of it like the classic game of Guess Who?” 
‘where your opponent uses provided data to guess the 
subject's identity and then uses a broader dataset to gather 
‘more information, Decreased privacy erodes when data 
boosts our confidence in identifying individuals, while 
increased privacy comes from a lower confidence level in 
distinguishing one individual from another. 


It's crucial to emphasize that privacy isn't only preserved 
by obscuring a person's name or ID. Even with fabricated 
labels for the subject, comprehensive knowledge about 
‘them can stil lead to privacy breaches. As an analogy, 
personal privacy can be compromised even if you know 
everything about a subject but refer to them by a 
pseudonym, much like my limited knowledge about my 
neighbour, whom I simply recognize as "that lady next, 
door", 


‘There are three significant roles in any informational 
‘message transfer: the sender, the receiver, and a potential 
interceptor. In the case of a shared dataset, we find these 
three actors present. Privacy Analysts sit between our 
source data and filter it, acting like a sender, sending a 
sanitized message out. The intended recipient is a Data 
Analyst, whether an internal colleague or a member of the 
public. Lastly, the Data Analyst could be a malicious actor, 
either gaining access to the data nefariously or, more 
pertinently, using a permitted dataset in a nefarious way. 


Guess Who Board Game 
hnttps:/www.amazon.ca/Original-Guessing-Double-sided 
~Character-Families/dp/Bo9WX9KT3S/ 
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While entropy is often linked with the predictability of 
physical systems, it can be better understood as a measure 
of chaos. Information Entropy, similar to the concept of 
chaos, quantifies the level of surprise within a system. Ina 
stable or predictable state, a system exhibits high entropy. 
Consequently, possessing complete knowledge about a 
subject leaves minimal room for surprise; conversely, 
Jnowing nothing allows for unexpected discoveries. 


While Claude Shannon is renowned for the bit, several 
names and related measures apply to the same concept of 
information entropy. The shannon measure represents the 
‘most minor, indivisible unit of information, a binary state 
(true or false) synonymous with abit (or binary digit). 
‘This concept builds on previous work defining the hartley 
(Ralph Hartley, 1929), which uses a decimal base and can 
also be referred toas the dit, among other terms used by 
‘Turing, Good, and others. 


An Example 


‘An example is likely useful in explaining how to enforce 
privacy through an automated mechanism. Samples often 
help us orient ourselves to the task athand. Naturally, ina 
discussion about privacy, we will want a dataset with 
individuals whose privacy we will want to protect. Still, we 
also want to ensure these individuals are not 
representative of anyone real. To achieve this, we have 
downloaded a sample customer dataset from Sling 
Academy", which gives usa ist of individuals and some 
personal information about them. 


% Customers Sample Data, Sling Academy 
https: /mww-slingacademy.com/article/customers-sampl 
e-data-csv-json-xml-and-xlsx/ 
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{'vealso cleaned the data to make it alittle more suitable 
{for our purposes, primarily by adding some randomized 
‘identifiers, introducing some nuL1s, and removing 
extraneous characters. 


“Thebase data sed includes to00 rows from sample customer set 


We have 1000 rows representing data you may find in 
sales lending, or social welfare domains. Each row 
‘epreseniaone durtomer witha unique 1D for exch petson, 
alarge random number we assign asa primary key. The 
individual's social sect timber and name are present 
to further identify them, We also have general contact, 
information (phone, email), demographic information 
(Gentes job), and some facts that are mcaningfal tothe 
business (numberof sales and total sales dollars), Note 
that gender uses the pubic toilet symbls for woman (©), 
EK) man (7), and neutral (9) as mandated by several US 
Bork States™ 


* Looking more closely at some of these fields, we might 
notice that the values inside our fields have different 


Solving The Mysteries Of The California Restroom Sign, 
Certified Access Specialist Institute 
|https:/casinstitute.org/article/solving-mysteries-califor 
nia-restroom-sign 
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associated frequencies. If we consider gender, we can see 
that there are only 3 possible values and that, for the most 
part, knowing about them does not tell us much about the 
individual. For example, if tell you that our subject is a 
‘man, you only have a1 in 508 chance of guessing who that 
person is. If, on the other hand, I tell you the subject's first 
name is "David" you have a much better chance of 
identifying the subject (1 in 17 chance), 


a 


‘Te probability of quessing an indwviual by thei gender depends on 
which genderis exposed 


‘This ability to measure predictability brings us close to 
Claude Shannon's definition of entropy, or surprise. Given 
the probability of guessing the person, we can also 
calculate the amount of surprise. By calculating the 
probability of randomly selecting an individual from 
within each category, we can get an idea of how private the 
field is. 


Inour case, the chosen field, gender, has 3 categories. By 
taking an average of their probabilities (@.0430), we get a 
general sense of their level of privacy. 


We can repeat this forall fields, giving us a privacy profile 
for the dataset. 
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Measures of privacy for each fel asa probability, Shannon, or Hart 


‘The privacy factor calculated for each field matches our 
intuition: perfectly unique IDs carry a very low privacy 
factor, and items like “first_name” are relatively 
anonymous. Having observed this, we can observe some 
non-intuitive findings, such as the very low privacy 
associated with “municipality”. We also havea 
cautionary reminder: emails and phone numbers are 
‘unique to people. 
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1/ Calculate the privacy profile of all the fields 
PrivacyProfile = (function(){ return ( 
hashed": null, 
table: Object entries(fieldst) -map(de>( 
et key = d(0]; 
de atl: 
let total = d.total; 


let categories = object.values(d. values) length; 
let privFactor = object.values(d. values) 
mmap(freqs>1-(1/freq)) 
reduce((a,d)=>a+d, 0) 
Jeategories 


let shannon = Object. values(d. values) 
imap(freqs>Math. 1og(freq)/Math.LN2) 
reduce((a,d)=>a¢d, 0) 
Jeategories 


let hart = Object. values(4.values) 


imap(freqs>Math. 1og(freq) /Math.LN10) 
reduce((a,4)=>a+d, 8) 
Jeategories 

return { 
“Field”: key, 


“PravacyFacter": privFactor, 
“shannons": shannon, 
arte’: hart, 
"Probability": 1-privFactor, 
Values": total, 
“categories” : categories 
, 

» 

mos 
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Applying the Measure to whole 
datasets 


‘four goal is to retain privacy while sharing data ina 
large-scale environment, we can use this as a tool. Rather 
than relying purely on people's subjective judgement, we 
can aid their judgement with an objective measure. When a 
data analyst comes to the Privacy Team looking for access 
to sensitive data, we can measure the privacy level oftheir 
data request. 


We can calculate the net privacy of a request by taking a 
cumulative product of the privacy factors of the individual 
fields being requested. 


Intuitively, we can immediately see that the entire dataset 
offers little tono privacy; however, this can be more 
formally stated by taking the product ofall the fields. 


PrivacyProfile.table.reduce( (a, d)=>( 
return a * d.PravacyFactor 
10); 


With items like the SSN, some items are @ privacy, and 
including these in the product results ina Net Privacy 
actor of "no privacy" (hard zero: ®) 


ae We should expect a Data Request to be only for the data 

be required for the analysis. Therefore, the request should 

(82 & include a reduced set. To perform this calculation on 
demand, we can create a generic function 


= sharing data with collaborators, Data Privacy 
Handbook, Utrecht University 

‘https: //utrechtuniversity github io/dataprivacyhandbook/ 
data-sharing-collaboration.html 
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Function GenerateReqest(req = null) { 
11 Af nothing specific was requested, 
Vimake this a randomized selection 
3f(!r09) 
et fldprob = math.ceil( 
PrivacyProfile.table. length * Math.randon() 
) 1 PrivacyProfile. Yength 


req = PrivacyProfile.table 
‘Filter (d=>(Wath.randon() *d.Field) 


, 
J/ select only the itens that actually exist 
req = PrivacyProfile. table 
‘Filter (ds>req.inciudes(d.Fiel4)) 
map(dsd.Field) 


11 prepare the return 
et ren = { req: req) 
1/ calculate the privacy factor 
Ftn.score = rtn.req.reduce((a,d)=> 
a*PrivacyProfile-hashed|d]| 'PrivacyFactor’ | 
1); 
1/38 the factor is oo small as to require a 
U/ scientific notation, let's just call it zero 
11 This convention is only in place to make reading 
J/the results easier 
Ftn.score = rtn.score.tostring().ineludes('e") ? 
0: rtn.gcores 
J/ get the filtered dataset 
Ftnvdata = basedata.nap(d=>( 
d= rtn.req.reduce((a,f)=>alf] = al], ()); 
J/ generate a randon id for each record 
4d = Object-assign( { 
Ad: Math. floor 
Wath. random) 


Number AX_SAFE_INTEGER) .toString(32) 
1.4; 
return d; 
ne 
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‘This generic function takes alist of variables the analyst 
‘wants access to, calculates the cumulative score, and 
produces the requested dataset. This can then be used to 
‘make a request (R88001) fr items we know are more 
generic fields. 


Re0001 = GenerateRegest(|" gender’, ‘age’, ‘hobbies’ |) 
Wh 

1 req: ["gender*, “age”, “hobbies” 

11 score: 0.8534004245932628 

Wy 


‘Again, the results match our expectations. We know that 
gender, age and hobbies have amuch higher Privacy 
Factor and should, therefore, be much safer. However, 
now that we have a clear measure, we can be more explicit 
and state that the cumulative product of the request has a 
Privacy Factor of 4.8534, 


Another Data Analyst may make an innocent request for 
data that surprises us. 


00062 
GenerateRegest(['state 
md 

11 req: ‘state’, ‘municipality’, ‘registered’ ], 
1) score: 8.006712926793361226 

wy 


mundedpality’ , ‘registered’ }); 


‘Their intent is a longitudinal study of the organization's 
successes and failures. Intuitively, a city should be 
reasonably anonymous. Surprisingly—likely due to our 
small data size—this request shows a significantly lower 
Privacy Factor (0.067). 
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‘The Privacy Team should examine this request much more 
closely to determine if there is a problem and if any further 
transforms can be applied to the dataset that would reduce 
the risk profile of the supplied dataset. 


Using the Thresholds in Automated 
Checks 


‘There is no point in having this tool if we can't use itn our 
systems to make our lives easier. We have already 
discussed using it to assist in evaluation, but applying it as 
‘an automated checks also possible. By setting a 
predefined threshold, we can perform an automated check 
‘on our datasets to ensure that requests are immediately 
rejected if they are below a predetermined value. 


While no single threshold is suitable forall cases, 
organizations can inspect sample requests to identify an 
appropriate point for automatic approval, automatic 
rejection, or calls for further inspection. These thresholds 
can be applied at different points in the approval chain, 


1. Creating the Request 
2. Privacy Approval 
3. Data generation 


When the data analyst begins to put their request together, 
‘we can offer them an estimate of the privacy factor of their 
request, As an inexpensive calculation, this can be done 
live, during the request process, through aweb or 
application interface. This lets the customer get early 
feedback about their requests, enabling them to plan their 
justifications or mitigating transforms that may reduce 
the risk profile. Assuming the Privacy Threshold is above a 
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predetermined safe threshold, the request can be delivered 
immediately with no further review. 


Upon submission of a request requiring further review, 
Privacy Evaluators can have the privacy estimate available 
to them as aholistic estimate ofthe risks involved, This, 
offers a tool to formulate their thoughts; it gives them a 
‘metric that allows them to focus on problem areas. This 
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also has the benefit of reducing evaluator bias, where one 
evaluator may personally have a different risk tolerance 
than another evaluator. This metric can form the basis for 
discussion among evaluators and with the customer to 
focus discussion on real (rather than perceived) issues. 


Lastly, we can continue evaluating results even as the 
customer performs their transforms. Itis possible for a 
dataset to be approved for use, with the understanding 
that the trusted analyst will perform aggregation that will 
increase the Privacy Factor of the dataset, but how do we 
‘know that they succeeded? Ifthe calculation is performed 
ona platform we control, we can use these same measures 
to evaluate the produced dataset. This evaluation can even 
0 further, using a more complete evaluation of the 
privacy, rather than the estimate we used to get this far. 


Lf their privacy measures are unsuccessful and we control 
the platform, we can issue an error message indicating 
that the privacy threshold was unmet. 


Judging the appropriate threshold for any given system 
will be contextual. Privacy analysts will need to evaluate 
where the cut-off should lie based on the shares’ openness 
and the nature of the data. Secure data managed in secure 
facilities will require fewer checks than data being 
published openly on the web. 


Conclusion 


‘The ability to measure data anonymity before any requests 
for access offers several advantages in managing Data 
Repositories. Claude Shannon's concept of information 
entropy gives us just such a mechanism, which measures 
the amount of entropy ina potential channel (shared 
dataset), 
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Using this estimate, combined with an evaluation of 
appropriate thresholds, itis possible to partially automate 
the approval mechanism for data shares. While complete 
automation is not possible, we can still reduce the burden 
(on Privacy experts and use these objective measures to 
reduce the impact of any personal biases the experts may 
have. 


We all recognize the importance of maintaining people's 
privacy and securing assets in shared datasets. These 
processes and techniques are useful in helping to reduce 
costs through objective techniques. 


Going a Little Further 


While this was an interesting idea to pursue, there are so 
‘many more layers that weren't covered, 


‘= Privacy Weights: every field should have a 
‘manually set weight associated with it. 
Immediately, this can include values that look 
unique but will be made available with other 
protections. For example, a Unique ID should be 
transformed for every request, so the privacy factor 
is likely known beforehand but is not represented 
by the actual data, Having some manually set 
adjustments can compensate for that. 


‘+ Anonymity Functions: The initial request could 
include several very simple and automatically 
applicable functions. Rounding all dollar figures to 
the nearest thousandth or including only the first 
three letters of a name will drastically change the 
privacy factor. 


‘© The Thing I Forgot: I'm sure you could makea 
million litte inclusions to a Data Request that 
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‘would allow for more refined automation and 
control. Feel free to drop them in the comments, 


Further Reading 
Some general articles that appeared in Google searches 


‘hile I was writing this looked interesting, I read them 
later. 


fora 
Ss ‘AGentle Introduction to Information Entrc 
an 


ae 


A 
blog discussing various aspects of data privacy 


While we discussed measuring entropy to ensureit was 
sufficient, there are several things you can do to artificially 
increase entropy when it isnot sufficient. 


___-_ Statistical approaches to de-identiieation, 
oi © Utrecht University: Discusses various. 
Pop methods for de-identitying data 
GES EK anonymity, L-diversity, T-closeness, and 
Differential privacy 


{really found Utrecht's book on the matter interesting. As 
was looking for references to help make points, I kept 
referring back to this book. 
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The Evils of Sequential 
IDs 


3 reasons not to use Sequential IDs as 
Primary Keys 


‘© Sequential IDs have several re-occurring risks that 
are known 


‘© We will discuss 3 risks associated with Sequential 
IDsas Primary Keys 


‘© Some valid cases for maintaining Sequential 
numbers are discussed 
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2001 was an exciting time forme. 


Thad just made a significant career change, realizing 
"Nursing was not my thing; I had changed fields and 
‘graduated with a CompSci degree, formalizing everything 
Thad tried to teach myself. The Y2K panic had happened, 
and the DOTCOM bubble had just burst, flooding the field 
‘with people with much more experience than I did and 
‘making finding a job all the more challenging. 9/x1 
brought security into crystal focus for professionals, 
everywhere, information security being part ofthat. In 
this context, I found my start with a small consulting 
company that helped small businesses transition from 
in-house software solutions to robust managed solutions 
and further into online solutions available anywhere. 


During one of these conversions, the owner of my 
‘company brought up an article he had recently read, and a 
debate ensued. According to the article, people should stop 
using sequential IDs for record identifiers. This seemed 
absurd to me: Sequence IDs are easy to create and easier to 
read; it makes no difference what you use internally. He 
was being absurd, 


Unfortunately, he lost the debate. 


1. Thave long since learned to be more open in my 
discussions 


2. Hewas right 


‘Many years later, I was working on refactoring a website. 1 
knew there were several vulnerabilities, but time is a 
precious resource, and I prioritized them as best I could, 
‘That's when we got the report; a customer had noticed 
that when you log in, you get logged into Company 5; he 


for 


assumed that if there is a5, there must also bea 6...and he 
gained access to data he had no business seeing. 


If Thad been more open to my boss's argument, I might 
hhave seen how risky sequential IDs are and prioritized that 
particular issue more highly. 


‘This issue was not unique to the system I was working on. 
Almost a decade later, I was in a team-building exercise: 
an intro C# class, It was al fairly basic stuff, but one thing 
stuck out... right in the Microsoft-branded textbook: the 
instructor pointed out a recommendation to use UIDs for 
primary keys in your database. 


‘The reason was simple: at some point, a PK will get 
exposed and act as the starting point for people to derive 
further information, 


‘The very solution my boss had proposed to a problem Thad 
dismissed. This problem is so widespread that it was 
necessary to discuss it in introductory programming 
classes, and the issues with it are obvious to anyone who 
thas grown up in this modern data security and privacy 
world! 


So... why do we still need to be reminded? Why are we so 
‘obsessed with putting numbers in order? 


False sense of utility 


one argument often raised in favour of sequential IDs is 
the familiarity of numbers that are in order. This is, 
expressed as, 

‘© Wehave a sense of how many records there are 


‘+ Itgives mea way to orient myself 
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‘+ Numbers are easier to read 


People are expressing emotional states: we are 
pattern-seeking monkeys, and seeing ordinal growth 

gives us a sense of comfort. As hard as itis, the data is not. 
there to satisfy our psychological comfort; instead, we are 
experts who deal with the uncomfortable realities of data 


‘This puts us ina tough place. Humans are really good at 
BEAEW) post hoc rationalizations. Any time we experience 
‘arti discomfort, we will each for the comfortable and justify it 
282 to ourselves in any way we can. Post hoc rationalization is 
Gees roblem we must each individually struggle with. It's 
QU888 jen awhile since ve ead The Righteous Mind, but 
“22 Haide always nails this poine™ 


Haidt invokes an evolutionary hypothesis... Reason, 
in this view, evolved to help us spin, not to help us 
lear.” 


g ‘We cannot allow ourselves to use what is comfortable but 
#6! rather to become comfortable with the most appropriate 
ERNE techniques. Todo that, weneed to take time to consider 


the benefits 


"The Riteous Mind, Jonathan Haid 
https://www.amazon.ca/Righteous-Mind-Divided Politic 
s Religion/ép/0307455777 

Post Hoc Rationalisation — Reasoning Our Intuition and 
Changing Our Minds, Jonathan MS Pearce, 2013-11-14 
https /skepticink com/tippling/2013/t1/14/post-hoc—rati 
onalisation-reasoning-our-intuition-and-changing-our 
-minds/ 

= Why Won't They Listen, NYTimes, William Saletan, 
2012.03.28 

https://aww:nytimes.com/2012/03/25 books/review/the- 
righteous-mind-by-jonathan-haidt html 
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If we use a sequence of numbers as a primary key, we use 
arbitrary keys. Ergo, these are not for human 
consumption, and therefore, they should not make your 
reading easier, their purpose is to make the computer's 
reading easier (and, thus, hopefully, our maintenance). 


Actitical touch-point in software development, expressed 
fits te eestor ef aclana agate 
trtomaldaide cata Ex hcommanden Redon ssid BEES 
dio tlegalcballrcebeiseatsocee tones 
pelea ear 


When searching for the above link, Google gave mea 
sidebar that stated 


Side effects are any changes in the state of the 
program or the environment that are not reflected in 
the function's output. Side effects can make the 
program unpredictable, hard to test, and difficult to 
debug 


We do not want to become dependent on side effects 
because, by definition, more than one effect comes froma 
single action. We cannot change the action to 
accommodate one effect because that would impact the 
other. 


Using the Primary Key asa sequence? You can't change its 
data type to something elseif necessary because that 
‘would make it non-sequential. Need to modify the order? 
It isnot possible; it has foreign key references. 


ur job, as experts, is not to define data in ways that make 
us comfortable; rather, it is our job to define data as iis. 
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define whatever itis we perceive — to trace its 
outline ~ so we can see what it really is: its 
substance. Stripped bare. As a whole. Unmodified. 
‘And to call it by its name — the thing itself and its 
components ... Nothing is so conducive to spiritual 
growth as this capacity for logical and accurate 
analysis of everything that happens to us. 


— Meditations 3.11" 


-@ Risks 
ea 
G85 Security 


‘The two main examples from my personal experience 
draw on the security risk associated with sequential 
‘numbers, Sequences of numbers give us a pattern that we 
can use to predict the next or previous value. 


Whether we meant to or not, by using sequential IDs, we 
have introduced information into our dataset. 


Inexposed applications that control information, this has 
been known to hint to people that more information is 
available 


In Nova Scotia in 2018, an individual using a government 
data transfer system" noticed that the numbers were 
relatively small when retrieving FOIP information releases 


* Meditations, Emperor of Rome Marcus Aurelius 
hnttps://gutenberg.org/ebooks/2680, 

Concerns teen being 'railroaded! in privacy breach to 
cover government slip, Jon Tattrie, CBC News, 2018-04-12 
https: /www.cbe.ca/news/eanada/nova-scotia/concerns-t 
‘een-being-railroaded-in-privacy-breach-to-caver-gove 
rment-slip-1.4616972 
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online. He surmised that the numbers were sequential and 
tried to type the following number into his browser. He 
‘was surprised when he got a FOIP record that did nat 
belong to him. His guess would likely have resulted ina 
miss ifnon-sequential values had been used. The large 
spaces with non-valid values would have been missed. 


Clicking on the items link brings the user to the 
items page and puts the URL for that item, including 
its unique identification number, in the browsers, 
address bar. Changing just the number in the 
address bar takes the user to a different item 
directly 


~ OPC Investigation Report" 


While this type of riskis well known, it remains common, 
but atleast itis obvious, Unfortunately, there are far more 
insidious ways in which data leaks occur. 


‘Sequential IDs can be exploited statistically to gain ago 
estimates of total counts. This is known as The German ay, Te 
‘TankProblem, Eizo 

eae 


During WWI, it was necessary to identify how many tanks 
Germany produced. Using the serial numbers off the 
heels of destroyed tanks, analysts could estimate the 
total number of wheels (and, therefore, tanks) in 
operation. 


©" Office of the Information and Privacy Commissioner for 
‘Nova Scotia INVESTIGATION REPOR'T IR19-01 Department 
‘of Internal Services Freedom of Information Access (FOIA) 
Website Catherine Tully Information and Privacy 
Commissioner for Nova Scotia 
hhttps://oipenovascotia.ca/sites/default/fles/publications/ 
O1PC%20Investigation%20Repart%201R19-01%20% 2815, 
%20Jan%202019%29 pat 
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wheels, which were observed to be sequentially 
numbered ... Analysis of wheels from two tanks 
yielded an estimate of 270 tanks produced in 
February 1944 ... German records after the war 
showed production for the month of February 1944 
was 276 


— German Tank Problem, Wikipedia 


Randomly selecting a sample and then looking at their 
‘sequence numbers gives us an idea of how dense the 


igepqea PPPulations. Wecan estimate population and 


ao 


sub-population) size from a sample using one of several 
ane funaions 

” when we share large datasets for analysis, we often 
exclude records deemed to be outside the Data Scientist's 
domain of interest (often for privacy). Unfortunately, if we 
share underlying sequential values, we expose more 
information about the more extensive set than we 
intended. The customer can derive information about the 
"unshared portion ofthe data 


9 Distributed Systems 


= WcS; Inthe early 90s, the pen Source Foundation (OSE 


released the Distributed Computing Environment, 
bringing large-scale computing systems to the fore. 


Sy Suddenly, individual analysts were nt ted to their acl 


for 


5 desktops and could distribute their processing across 
unlimited computers and processors. While less readily 
available than modern Cloud services, 
intra-organizational processing became much more 
powerful, 
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‘ne problem that arose from using distributed compute 
systems is that it takes time for the processing units to 
talk to each other. To optimizea distributed solution, the 
processing units must be able to act independently af each 
other as much as possible. 


Unfortunately, sequential keys require coordination. At the 
very least, a single computer must fouch every record to 
coordinate the generation of the next value. 


‘This acts as a bottleneck in processing, 


For the processing units to assign a coordinated value, 
they must communicate with each other to serialize the 
values. This means that either the sequential value must 
be added to the records before distributing the data 
(requiring the cast of producing a new dataset), or the 
values must be assigned in blocks (effectively randomly 
‘across partitions) 


“The goals o get data ont the distributed processors wih as ite fuss 
‘25 possible 


While the OSF came up with an exciting solution to this 
problem, random numbers and cryptographic hashes have 
become more prevalent in the intervening decades. Using 

‘huge random numbers, we can be reasonably assured that 
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the number generated will be unique to the population, 


gS {B meaning there is no need to coordinate between systems. 

SPE wis) Each processor can independently assign identifiers 

GGPER? without concerning itself with the work of other 
processors in the system, The results can later be merged 
as one step. 


Hashes serve a similar purpose but have the benefit of 
being deterministic. Itis possible to process a unit and 
identify which unit it is without having to do expensive 
scanning on the more extensive set. Suppose some data 
feature can be used as a unique identifier; hashes can be 
utilized to convert that feature into a reasonably sized and 
well-distributed integer suitable for use as a key". Like 
random numbers, we can be assured that there will not be 
conflicting number. Still, we also have the benefit of 
being able to integrate overlapping entities if they are 
determined to represent the same thing. 


Imbalanced Indexes 


W225 ‘The Law of Thiees dictates that I make three subpoints, so 
2 precepts 


‘There is an argument that sequential IDs do not balance 
‘well inthe binary tree during partitioning for storage, 
processing, or indexing of datasets. This first came up 
during a (now lost) lecture I watched by Damien Katz (I 
think) of IBM regarding CouchDB. He discussed why 
random keys were important to database performance and 
asked audience members if their keys became imbalanced, 
‘This highlighted issues I remember from Project 
Gutenberg's data store in which they partition based on 


© See chapter “De-duplicating Data Storage in Data 
Science” 
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the first digit ofthe sequential number of the book 
submission. 


Under these conditions, Benford's Law is going to burn us. 


Benford’s Law states that a number's first digit is often 
low. Ina sequential set, it is obvious why this would be 
true: every time you change scales (say 1's to 10's), it takes, 
ten times longer to change the first digit (in other cases, 
this remains a mystery to me). 


For partitioning, the first partition will get 30% of the 
work, while partition 9 will only receive 5%, creating 
massive bottlenecks while one grinds away and another 
sits idle. 


While I have seen real-world examples of this and do 
recognize the problem as a problem, 1am generally 

dismissive of it. Implementing a fast, balanced hashing 
algorithm that resolves this problem is relatively simple. 


Using the example from above, Benford's Law can be 
circurnvented by simply reversing the numbers 
(ffectively a Modt@ hash). So instead of sending the 
sequence [11, 182, 183, 104, 105] to the same 
partition, we are better offusing (101, 261, 361, 401, 
501] to distribute the values more evenly. 


Having come across articles on the hashing algorithms 
used in indexing by some of the big database engines, I 
expect data balancing during partitioning to be a 
well-understood and handled problem in any platform 1 
adopt. If you have to balance your values, these may be 
valid considerations, but looking at your chosen tools may 
be in order. 
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Valid Uses 


As with any absolute rule... well. there isno such thing as, 
‘an absolute rule. There are exceptions and valid reasons 
for why you might want a sequential identifier. 


Audits 


Going back to my very early days in my career, when my 
colleagues and I were just trying to figure out how to turn 
dollar, afew of us took to teaching How to use Office 
night classes at local community centres, job retraining 
centres, or (mine) crime exit programs; all of us had a 
‘module Use Excel to Make an Invoice. 


(One weekend, Iwas at one of these friends' houses. We 
were having a few beers and working on a hobby problem 
together when he mentioned a lesson he had learned from 
one of his students, 


{As part of his invoice class, he taught how to make invoice 
‘numbers automatically (last inv number + 1), buthe 
told the class, “If you want to make your business look 
‘more impressive to your customers, change that formula 
tolast inv nunber + 10. Even ifyou only have one 
customer, it won't look that way.” 


‘Aand near the back slowly went up. 


"Lam an accountant and have to recommend people not to 
do that. When you undergo a tax audit, they will identify 
the gaps as missing invoices and assume you have dane 10x. 
‘as much business as you have reported, They will estimate 
‘what those invoices are worth and send you all for the 
apparently missing nine invoices." 
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‘That's something you want to avoid, but the ability to 
‘match patterns and find gaps, which poses the original 
tisk, also has benefits. The main reason I will reach for a 
‘sequential ID is for audit purposes: a sequential ID can 
verify missing data by creating gaps in the sequence when 
data is removed 


You want to avoid that, but the ability to match patterns 
and find gaps, which poses the original risk, also has 
benefits. The main reason I will each for a sequential ID is 
for audit purposes: a sequential ID can verify missing data 
by creating gaps in the sequence when data is removed, 


'Non-technical individuals often don't understand how 
malleable information” is in a digital world™; itis easier 
for them to comprehend that the number has incremented 
than it is to understand a chain of hashes (Merkle Trees), 
soadding a sequence ID for visiting auditors can be a 
courtesy. 


Replication Checkpoints 


(One of the databases [like to use alot is CouchDB, and one 
of its super-powers is its ability to replicate the data 
across multiple peers. 


>The Wayback Machine of The Intemet Archive 
demonstrates the maleability of web content. It isan 
independant archive showing when websites have changed 
their content. Often this is the only record that a change 
hhas been made. 

hhttps://web archive.org/ 

See Chapter “Paper as a Digital Storage Medium” for 
‘ways to manage malleable digital information 
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‘Toachieve this, a few things must occur to address conflict 

9:40 resolution, change histories, and spit brains™, but one of 

& the critical elements isthe simples: every change to the 
latabase gets a sequence number". 

BARES amtabase gets imbe 


Z) 


‘This number is used during replication to act asa 
bookmark, allowing one database to request every change 
since a point in time. 


‘This is an essential tool to be used during replication, 
being able to state: give me everything since checkpoint X. 


A similar construct is also available in MS SQL Server, 
called a ronversion (it was a timestamp last [used it). I'm 
sure most other DBs have the same concept: a globally 
incrementing sequence. 


= The split-brain problem is one in which two copies of a 
database are receiving updates but not able to 
‘communicate with one another, 

hhttps:/ guide couchdb org/editions/1/en/contflicts htm! 
5 Replication, CouchDB Manual 

https: //guide.couchdb org/editions/1/en/replication html 
magic 
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Solutions 


Given ll ofthis, what should we do when constructing 
identifiers? 


Big Integers 


‘No matter what identifier style you use, make sure you use 
big integers. 


In those early days of web applications, our databases had 
auto~incrementing indexes that could have been randomly 
specified as the sequence. However, they were constrained 
toa 32-bit integer, with only room for about 4 billion 
records, 


Not alot by modern standards. 


If you use random numbers, you only benefit from the 
gaps if they are reasonably large. If you use sequential 
values, you do not want to run out. 


Random values, Hashes, and UIDs 


Generally, I suggest using (at least) a128-bit integer for 
storage. I don't choose this number arbitrarily; yes, it's 
large, but it is also the size of a UUID. This has the 
‘convenience of having a ready-to-use storage type on 
‘most systems and standardized functions for generating 
random (or near-random) numbers of that size. 


I say near random because while UUIDvA is a random 
‘number, other versions of UUID are not. For example, 
version 1 depended on the computer's MAC address to 
form part of its uniqueness. Further, some 
implementations of UID generators are not truly 
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standards conformant; instead, they include timestamps 
in the number for convenience. 


Hashes are another interesting option; while 
deterministic, they are effectively random. Ifa dataset has 
natural key that we do not necessarily want to share but 
‘makes an obvious primary key, hashes can be a convenient 
‘way to mask the natural key. It generates alarge 
(effectively) random integer. While an md5 will fit into a 
128-bit integer, most modern hashes require more space. 


357) one exciting benefit of using a hash occurs when you 


require scrambling of your keys. Using a salt value 
(perhaps a salt value per customer), we can allow for 
simple regeneration of different random-like keys if we 
‘ever require them to change. 


Primary Keys must never be 
sequential 


‘There are reasons to use sequential numbers, but making a 
primary lookup value creates the temptation to expose it 
to the client, and the moment you have something that 
looks tempting, it will happen. Let me say that again: 


Your primary key will be exposed to the customer. 


‘There are all kinds of reasons we promise it won't, but at 
some point, somebody will make a mistake and expose 
that number. Afterall, its the value we look at in 
individual records. If for no other reason than an analyst 
‘wants to doa join against two tables, that primary key is 
going to get exposed. 


[By using random-like values for PKs, there is less 
temptation to utilize them for inappropriate things. We can 
only derive information from them indirectly and, 
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therefore, don't try to use them for secondary purposes. If 
‘we need to add sequential values, put them ina field that 
explicitly defines why they exist, clarifying their purpose 
(and non-purpose). 


Non-primary, Unique Identifiers 
Just keep an eye out for them, 
Do you havea field called created_time? 


fyour data has acreation timestamp, it may have 
something very close to a unique identifier. While this may 
not be sensitive information, users can use itto join data 
against other datasets, 


While useful for diagnostic purposes, these low entropy 
vvalues pose a risk when exposed to the broader customer 
‘group. We can simply remove these values from data 
shares by not having them as the primary key. By having 
these unique values not acting as primary keys, we are not 
bound by their side effects and are less tempted to share 
them inappropriately. 


lf these values are required for diagnostics, then we must 
‘maintain them, Our only defence is to be aware of the risks 
and keep the values unbound and hidden from the shared 
datasets. 


Conclusion 


While there are valid reasons for maintaining sequential 
values in data, using sequential values as the primary key 
poses significant risks to data. Given the nature of primary 
identifiers, the cost of the risks versus the perceived 
benefits does not justify their use. 
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‘Separating valid sequential use cases from the primary 
identifier allows us to separate the concerns of the data's, 
purpose, allowing us to move parts independently of side 
effects. Large integers have been widely available for over 
30 years, allowing us to create significant gaps of negative 
space in our data to evade malicious detection. Modern 
random number generation and cryptographic hashes 
offer a secure way to populate that large space. 


Given these simple and well-established solutions, the 
risks to security and performance posed by using 
‘sequential primary keys do not outweigh their perceived 
benefits. 


{As engineers and architects, we need to consider not how 
‘we have always perceived data but rather the historical and 
scientific context surrounding protecting that data, The 
security mechanisms of the past no longer suit the 
‘computational realities of the present. 


1928 US patent 1,657.41: Enigma Cypher Machine 
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Conclusion 


You're still here? 
Wow, that's really cool 


(ne of the obstacles to change in the office is being willing 
‘to work to understand the system around us. People 
quickly hang their brains on a coat hook at the door to get 
through the day without struggling and hoping to avoid 
conflict. If you've made it this far, you have demonstrated 
something more than that. 


Persistence, or what some might call 
"Bloody-Mindedness', is the key to unlocking the 
‘mysteries of the complex system around us. It's this 
‘unwavering determination that will enable you to 
comprehend the generic components of these systems, 
and ultimately, to effect change. 


When I embarked on this journey, I must admit, I didn't 
anticipate much. But I've always harboured a deep passion 
for change anda fervent desire to see the world transform 
for the better. Your presence here, your commitment to 
understanding and changing complex systems, has truly 
surprised and inspired me. You've shown that you're the 
kind of person who can make a tangible difference in this 
world, And that, my friends, fills me with joy. 


Whether you are a manager trying to gain insight into your 
team’s challenges or a student looking at the issues you 
will face, Information Systems are no longer external to 
‘our lives and careers but are the business processes 
themselves. Ihope you gained a general perspective of the 
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‘world of information systems and how they integrate (or 
don't integrate) directly into our lives. 


fthese chapters prepared you for the exciting challenges 


and opportunities around you ... well .. mission 
accomplished. 


an 
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-A- 
Astrolabe: Tips For 
Teachers 
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‘As an introduction, promise students that in the 
next 30 minutes they are going build their first 
‘computer program, 


Don't have students cut-out the circles. By keeping 
them on the page, you imply no rotation. This 
delays discovery long enough for you to suggest it 
as intentional thought. 


Students will have different tightness of their 
spirals. This isa hint as to the trade-off between 
high differentiation, and more colours. 


‘There will be the student that is tentative about 
punching a hole in their paper. The hole needs to be 
big enough that the colour shows through, 
Encouraging the student to bea litte rough with it 
gets a couple of chuckles, 


Have the students (as a group) describe how their 
‘computer meets the three criteria that were laid 
out. 


6. Ignore the aesthetics until the very end. Inevitably 
a student will bring up aesthetics. Act surprised, 
like you forgot about it. Then hold up your demo 
and rotate it. f you don’t like th palette you got, 
tum ito select different palette 

1. Like any activity, leave this to the end ofthe class. 
‘The excitement caused by comparing colour 
schemes brings th class toan end, Expect to do no 
more than have students take a written handout on 
the way out the door 

8. Atthe beginning, you drew spiral This was not 
strictly necessary. Really, you should do this using 
Radial Coordinate graphs (on paper), ora 
continuous formla. The act of counting across, 
and in, was the ral spiral algorithm, 


Activity Plan 
Skills 


Algorithmic Thinking 
Problem definition 

Learning Objectives 
«Describe the use of colourin categorisation 


‘© Describe the components of a good colour 
palette 


Apply the components of a good colour palette 
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Audience 
© Grades 7 — Undergraduate 
© Introductory programming students 
© Introductory data analysis students 
Materials 


‘© Colour Wheel: Make one, or print the 
templates 


‘Radial graph: must be same size as colour 
wheel 


Pencil: must be sharp 


Progression 


1. Learn the three components of a good colour 
palette (~5 min) 


2. Construct a colour selector (~5 min) 


3. Reflect 


1. Gather Supplies (pencil/printouts) 


2. All the way across, 1 over, step in, Repeat until 
full 


3. Punch Holes 


‘4. Select colours, record by colouring in grid 
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Reflection/Assessment 


What changes could you make to get even 
more colours? 


What happens to differentiation as you get, 
more colours? 


What decision making inputs can you control 
onthe machine? 


How can you change the colours f they aren't 
aesthetically pleasing? 
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